Fine-grained fast-math flags #1991

lcw · 2023-07-05T22:06:57Z

Is your feature request related to a problem? Please describe.

To get kernel performance matching clang we have had to add fast-math flags such as contract (which clang and nvcc do by default). Currently, we do this by an ugly-hack, see for example

CUDA.jl/perf/volumerhs.jl

Lines 21 to 57 in bb37b50

    
           # HACK: module-local versions of core arithmetic; needed to get FMA 
        
           for (jlf, f) in zip((:+, :*, :-), (:add, :mul, :sub)) 
        
               for (T, llvmT) in ((:Float32, "float"), (:Float64, "double")) 
        
                   ir = """ 
        
                       %x = f$f contract nsz $llvmT %0, %1 
        
                       ret $llvmT %x 
        
                   """ 
        
                   @eval begin 
        
                       # the @pure is necessary so that we can constant propagate. 
        
                       @inline Base.@pure function $jlf(a::$T, b::$T) 
        
                           Base.llvmcall($ir, $T, Tuple{$T, $T}, a, b) 
        
                       end 
        
                   end 
        
               end 
        
               @eval function $jlf(args...) 
        
                   Base.$jlf(args...) 
        
               end 
        
           end 
        
           let (jlf, f) = (:div_arcp, :div) 
        
               for (T, llvmT) in ((:Float32, "float"), (:Float64, "double")) 
        
                   ir = """ 
        
                       %x = f$f fast $llvmT %0, %1 
        
                       ret $llvmT %x 
        
                   """ 
        
                   @eval begin 
        
                       # the @pure is necessary so that we can constant propagate. 
        
                       @inline Base.@pure function $jlf(a::$T, b::$T) 
        
                           Base.llvmcall($ir, $T, Tuple{$T, $T}, a, b) 
        
                       end 
        
                   end 
        
               end 
        
               @eval function $jlf(args...) 
        
                   Base.$jlf(args...) 
        
               end 
        
           end 
        
           rcp(x) = div_arcp(one(x), x) # still leads to rcp.rn which is also a function call

Describe the solution you'd like

I would like a macro like @fastmath that had fine-grained control over the fast-math flags.

Describe alternatives you've considered

KernelAbstractions used to do this with https://github.com/JuliaLabs/Cassette.jl and other people use macros (although it opens up less optimization and thus not desired) https://github.com/JuliaLabs/Cassette.jl. I don't know if https://github.com/JuliaDebug/CassetteOverlay.jl can be used with kernels but it might be a possible way to implement this.

It would be nice if this functionality eventually got added to base julia.

The text was updated successfully, but these errors were encountered:

maleadt · 2023-07-06T06:57:04Z

It would be nice if this functionality eventually got added to base julia.

I agree, so better file this on the Julia repository?

lcw · 2023-07-06T16:32:25Z

Looks like there already is at least one JuliaLang/julia#49890.

maleadt · 2023-07-06T17:35:07Z

I think we can close this issue then?

lcw · 2023-07-06T20:50:45Z

Sure.

vchuravy · 2023-07-07T16:56:39Z

I was thinking we could do @cuda math=(:contract, :reassoc) and then use a overlay table to switch the implementation.

lcw · 2023-07-07T17:02:49Z

I like that idea. So all of the code in the kernel (even within function calls) would use contract and reassoc?

vchuravy · 2023-07-07T18:15:37Z

Yeah, kinda inspired by JuliaLang/julia#50239, I think we could solve this with stacked OverlayMethodTables

maleadt · 2023-08-14T08:53:27Z

I think it would be better to prototype this in an external package, and have CUDA.jl use that package's overlay table. That way the functionality wouldn't be locked into the CUDA.jl ecosystem either.

lcw · 2023-08-16T19:22:28Z

Something akin to https://github.com/JuliaSIMD/LLVMLoopInfo.jl? That would be great. Is the idea to use CassetteOverlay to create some standard passes for each fast-math flag and the use those in the kernels via macros? I am not sure how to stack these for combining fast-math flags.

vchuravy · 2023-08-16T21:19:24Z

No more like https://github.com/vchuravy/FastmathOverlay.jl

I don't have a good solution for combining flags... yet.

vchuravy · 2023-08-17T01:47:08Z

Okay #2037 is a prototype of that idea. Now that we know it is feasible we have to decide if we like it.

Composition of certain things is possible, and for other things it is tedious.

As an example say you want to opt into :contract on all floating-point ops and we add a speculative :fast_trig.
That should work fine since we can form StackedMethodTable(FastTrig, StackedMethodTable(Contract, CUDA)).

We sadly can't use the same method for composing :contract and :reassoc. Since the definition in the outer one, will shadow the definition in the inner one. This also means that for :FastTrig we may want something like :FastTrigCUDA since we will otherwise shadow the CUDA definitions.

Right now the only idea I have for :contract & :reassoc is the tedious solution to manually (or through meta-programming) create a method table Contract×Reassoc that defines the combinations we want.

lcw added the enhancement New feature or request label Jul 5, 2023

lcw closed this as completed Jul 6, 2023

lcw reopened this Jul 7, 2023

maleadt added the cuda kernels Stuff about writing CUDA kernels. label Jul 7, 2023

maleadt added the upstream Somebody else's problem. label Aug 14, 2023

vchuravy mentioned this issue Aug 16, 2023

Allow for generic MethodTableView JuliaGPU/GPUCompiler.jl#494

Draft

maleadt linked a pull request Aug 17, 2023 that will close this issue

Add contract through FastmathOverlays.jl #2037

Draft

maleadt mentioned this issue Aug 22, 2023

Add some more fastmath functions #2047

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-grained fast-math flags #1991

Fine-grained fast-math flags #1991

lcw commented Jul 5, 2023

maleadt commented Jul 6, 2023

lcw commented Jul 6, 2023

maleadt commented Jul 6, 2023

lcw commented Jul 6, 2023

vchuravy commented Jul 7, 2023

lcw commented Jul 7, 2023

vchuravy commented Jul 7, 2023

maleadt commented Aug 14, 2023

lcw commented Aug 16, 2023

vchuravy commented Aug 16, 2023 •

edited

Loading

vchuravy commented Aug 17, 2023

Fine-grained fast-math flags #1991

Fine-grained fast-math flags #1991

Comments

lcw commented Jul 5, 2023

maleadt commented Jul 6, 2023

lcw commented Jul 6, 2023

maleadt commented Jul 6, 2023

lcw commented Jul 6, 2023

vchuravy commented Jul 7, 2023

lcw commented Jul 7, 2023

vchuravy commented Jul 7, 2023

maleadt commented Aug 14, 2023

lcw commented Aug 16, 2023

vchuravy commented Aug 16, 2023 • edited Loading

vchuravy commented Aug 17, 2023

vchuravy commented Aug 16, 2023 •

edited

Loading