Skip to content

Conversation

@mhauru
Copy link
Member

@mhauru mhauru commented Jul 31, 2025

Work in progress, though most tests now pass. Current known issues are calling varargs functions and and more allocations than on v1.11.

@mhauru mhauru force-pushed the mhauru/julia1.12 branch from 2dfd832 to 0755577 Compare July 31, 2025 10:54
@github-actions
Copy link

github-actions bot commented Aug 5, 2025

Libtask.jl documentation for PR #196 is available at:
https://TuringLang.github.io/Libtask.jl/previews/PR196/

@mhauru
Copy link
Member Author

mhauru commented Aug 7, 2025

The test suite now passes except for the test cases called "nested with args (dynamic)" and "nested with args (dynamic + used)", which segfault due to JuliaLang/julia#59222. Waiting on a response there to see whether that's a Julia bug or something we need to adapt to.

@yebai
Copy link
Member

yebai commented Aug 7, 2025

JuliaLang/julia#59222

These tests look slightly weird. They might be violating opaque closure assumptions, e.g. dynamic dispatch of g(xs...) depends on inputs to the opaque closure itself. Do we need them here?

@yebai
Copy link
Member

yebai commented Aug 7, 2025

This works, it doesn't answer why JuliaLang/julia#59222 fails, though.

julia> g(xs::Tuple{Vararg{T}}) where T = 1

julia> (t::Tuple{typeof(g)})(xs::Tuple{Vararg{T}}) where T = t[1](xs)

julia> ir = Base.code_ircode_by_type(Tuple{Tuple{typeof(g)}, Tuple{Vararg{Symbol}}})[1][1]

julia> ir.argtypes[1] = Tuple{typeof(g)}

julia> oc = Core.OpaqueClosure(ir, g; isva=true, do_compile=true)

julia> oc(:a, :b)  # works on 1.12 and 1.11

EDIT: I suspect that g(xs...) is type unstable, though I still don't understand why that leads to a segfault on Julia 1.12 despite working fine on Julia 1.11.

@mhauru
Copy link
Member Author

mhauru commented Aug 8, 2025

The case where this comes up is if, in a TapedTask, you have a dynamic call to a varargs function that might have a produce statement. In that case Libtask.DynamicCallable calls it with the explicit arguments known at runtime, and hence the code_ircode_by_type call sees the concrete types rather than a Vararg type. I could maybe workaround that with some hacking of ir.argtypes. However, I couldn't figure out a nice way to do it, whereas the segfault seems to me like either Julia is bugging or something has changed about OCs that I need to understand.

Your comment about the callable being in the captures is good though, because it made me realise that I don't need that to replicate the problem. This, too, segfaults:

module MWE

g(xs...) = 1

ir = Base.code_ircode_by_type(Tuple{typeof(g), Symbol, Symbol})[1][1]
ir.argtypes[1] = Tuple{}

oc = Core.OpaqueClosure(ir; isva=true, do_compile=true)
oc(:a, :b)  # segfault on 1.12

end

@yebai
Copy link
Member

yebai commented Sep 18, 2025

Extra allocations might be due to: JuliaLang/julia#58780

@mhauru
Copy link
Member Author

mhauru commented Oct 9, 2025

With the workaround from JuliaLang/julia#59222 (comment) in place, tests now pass. However, something absolutely horrible has happened to performance and allocations. The benchmark suite is failing because it hasn't been updated to the new version of Turing (this happens on main too, not a problem of this PR), but the benchmarks that pass before it crashes show the following (run on my laptop):

On v1.11.7:

benchmarking rosenbrock...
  Run Original Function:  186.625 μs (24 allocations: 6.25 MiB)
  Run TapedTask: #produce=1;   7.161 ms (299155 allocations: 10.83 MiB)
benchmarking ackley...
  Run Original Function:  1.161 ms (0 allocations: 0 bytes)
  Run TapedTask: #produce=100000;   29.137 ms (899584 allocations: 21.36 MiB)
benchmarking matrix_test...
  Run Original Function:  94.958 μs (18 allocations: 576.47 KiB)
  Run TapedTask: #produce=1;   453.541 μs (546 allocations: 594.52 KiB)
benchmarking neural_net...
  Run Original Function:  444.444 ns (8 allocations: 576 bytes)
  Run TapedTask: #produce=1;   2.940 μs (54 allocations: 2.17 KiB)

On v1.12.0:

benchmarking rosenbrock...
  Run Original Function:  182.625 μs (24 allocations: 6.25 MiB)
  Run TapedTask: #produce=1;   3.261 s (15156390 allocations: 274.15 MiB)
benchmarking ackley...
  Run Original Function:  1.151 ms (0 allocations: 0 bytes)
  Run TapedTask: #produce=100000;   914.884 ms (4394003 allocations: 80.78 MiB)
benchmarking matrix_test...
  Run Original Function:  95.458 μs (18 allocations: 576.47 KiB)
  Run TapedTask: #produce=1;   137.134 ms (266275 allocations: 6.93 MiB)
benchmarking neural_net...
  Run Original Function:  437.712 ns (8 allocations: 576 bytes)
  Run TapedTask: #produce=1;   40.125 μs (226 allocations: 7.52 KiB)

That's a 20-1000 fold slowdown. This makes Libtask essentially useless on v1.12. These benchmarks also pass, and show similar results, without the aforementioned workaround. Note that the fix for JuliaLang/julia#58780 is in 1.12.0, that should no longer be the issue.

I'll see if I can boil this down to a simple example and use that to ask for advice from Julia devs.

@yebai
Copy link
Member

yebai commented Oct 9, 2025

This might be the cause: chalk-lab/Mooncake.jl#714 (comment)

@mhauru
Copy link
Member Author

mhauru commented Oct 9, 2025

Here's a quick snippet to check this:

module MWE

using Libtask, BenchmarkTools

function f(x)
    i = x .+ 1
    j = x .- 1
    ret = (i - j .^ 2)
    produce(ret)
    return ret
end

function wrap(x)
    tt = TapedTask(nothing, f, x)
    consume(tt)
    return nothing
end

@btime wrap(randn(100))

end

On 1.11.7 32.166 μs (394 allocations: 17.55 KiB), on 1.12.0 1.576 ms (4207 allocations: 115.39 KiB).

Since @serenity4 is working on a fix that might help, I'll wait for that before putting more effort into figuring out what is going on here.

@penelopeysm
Copy link
Member

performance is still bad on 1.12.1

@mhauru
Copy link
Member Author

mhauru commented Oct 23, 2025

After copying over some fantastic work by @serenity4 in chalk-lab/Mooncake.jl#714, doing some optimisation passes on opaque closures, we are back in business. Benchmarks on v1.11 (copied from above):

benchmarking rosenbrock...
  Run Original Function:  186.625 μs (24 allocations: 6.25 MiB)
  Run TapedTask: #produce=1;   7.161 ms (299155 allocations: 10.83 MiB)
benchmarking ackley...
  Run Original Function:  1.161 ms (0 allocations: 0 bytes)
  Run TapedTask: #produce=100000;   29.137 ms (899584 allocations: 21.36 MiB)
benchmarking matrix_test...
  Run Original Function:  94.958 μs (18 allocations: 576.47 KiB)
  Run TapedTask: #produce=1;   453.541 μs (546 allocations: 594.52 KiB)
benchmarking neural_net...
  Run Original Function:  444.444 ns (8 allocations: 576 bytes)
  Run TapedTask: #produce=1;   2.940 μs (54 allocations: 2.17 KiB)

and now on v1.12.1:

benchmarking rosenbrock...
  Run Original Function:  191.750 μs (24 allocations: 6.25 MiB)
  Run TapedTask: #produce=1;   2.051 ms (679 allocations: 6.27 MiB)
benchmarking ackley...
  Run Original Function:  1.161 ms (0 allocations: 0 bytes)
  Run TapedTask: #produce=100000;   63.696 ms (699582 allocations: 18.31 MiB)
benchmarking matrix_test...
  Run Original Function:  100.542 μs (18 allocations: 576.47 KiB)
  Run TapedTask: #produce=1;   313.458 μs (529 allocations: 594.06 KiB)
benchmarking neural_net...
  Run Original Function:  444.657 ns (8 allocations: 576 bytes)
  Run TapedTask: #produce=1;   15.250 μs (168 allocations: 6.27 KiB)

Some speed-ups, some slowdowns, but nothing like the orders of magnitude we were seeing.

@mhauru mhauru marked this pull request as ready for review October 23, 2025 12:13
@mhauru
Copy link
Member Author

mhauru commented Oct 23, 2025

We may still end up taking more code from chalk-lab/Mooncake.jl#714, or maybe moving some of the OC optimisation stuff to MistyClosures to reduce code duplication. However, while all that is being worked on, I would propose merging and releasing this now, since it's definitely a vast improvement over Libtask just flat out not working on v1.12.

The benchmarks are failing on the main branch too. I suspect IntegrationTest / Turing.jl will fail because it hasn't been updated to keep pace with Turing, but that I think is a separate issue.

Copy link
Member

@sunxd3 sunxd3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all look reasonable, although I don't think I fully grasp the consequences of the code changes and the deeper reasons connected to Julia 1.11 -> 1.12.

That being said, the test are passing and it alleviates the performance issue, which is really blocking. So this is a merge for me.

@mhauru
Copy link
Member Author

mhauru commented Oct 24, 2025

I also ran parts of the Turing.jl test suite locally with this version of Libtask, and didn't encounter any problems, so I think we are fine. Thanks @sunxd3, will merge and release.

@mhauru mhauru merged commit 7fbac1c into main Oct 24, 2025
12 of 14 checks passed
@mhauru mhauru deleted the mhauru/julia1.12 branch October 24, 2025 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants