wip: CachedUpdatelessFunction #28

ericphanson · 2025-03-27T00:13:28Z

I tried to see if AllocArrays would work with Lux/Boltz, since it seems to be working for ObjectDetector. There the only issue I had was performance problems with mixed UnsafeArrays vs matrices leading to generic matmul instead of blas. This is kinda similar to the issue from JuliaLang/julia#57799 (though technically quite different) and I came up with a better workaround this time: we can adapt the function within the with_allocator block to homogenize the types. As long as adapting is fast relative to the runtime, it's kinda fine.

However this means the function we evaluate does not persist, so it might not work well for ObjectDetector which updates the model struct even on the forward pass with its W parameter. So here I have the terrible name CachedUpdatelessFunction, where Cached is meant to mean it reuses allocations (not values), Updateless means it won't persist updates to the model, and function for a generic callable (doesn't have to be an ML model).

With this, I get

using Boltz, Lux, JLD2, AllocArrays, Adapt, Random, BenchmarkTools

model = Vision.VGG(13; pretrained=true)

model_aa = AllocArrays.CachedUpdatelessFunction(model)

batch = rand(Float32, 224, 224, 3, 20)

ps, st = Lux.setup(Random.default_rng(), model)

println("Benchmark with Array:")
display(@benchmark $model($batch, $ps, $(Lux.testmode(st))))

println("Benchmark with AllocArray:")
display(@benchmark $model_aa($batch, $ps, $(Lux.testmode(st))))

yielding

julia> include("script.jl");
Benchmark with Array:
BenchmarkTools.Trial: 3 samples with 1 evaluation per sample.
 Range (min … max):  1.914 s …   1.970 s  ┊ GC (min … max): 4.06% … 3.62%
 Time  (median):     1.938 s              ┊ GC (median):    3.78%
 Time  (mean ± σ):   1.940 s ± 28.213 ms  ┊ GC (mean ± σ):  3.82% ± 0.22%

  █                       █                               █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.91 s         Histogram: frequency by time        1.97 s <

 Memory estimate: 2.33 GiB, allocs estimate: 476.
Benchmark with AllocArray:
BenchmarkTools.Trial: 3 samples with 1 evaluation per sample.
 Range (min … max):  1.967 s …   1.991 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.977 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.978 s ± 12.323 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                      █                                █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.97 s         Histogram: frequency by time        1.99 s <

 Memory estimate: 152.69 KiB, allocs estimate: 759.

So we were able to eliminate GC and most allocations, though possibly with a small runtime cost. Maybe we are still not hitting some optimized path.

This feels a bit clunky/hard-to-describe currently, and moves Adapt from an extension to a regular dep, so not sure we want this but thought I'd put it up.

wip: cachedupdatelessfunction

698e1ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wip: CachedUpdatelessFunction #28

wip: CachedUpdatelessFunction #28

Uh oh!

ericphanson commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wip: CachedUpdatelessFunction #28

Are you sure you want to change the base?

wip: CachedUpdatelessFunction #28

Uh oh!

Conversation

ericphanson commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant