Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Try to retain entry weight during profile synthesis #111971

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

amanasifkhalid
Copy link
Member

Part of #107749. Prerequisite to #111915. Regardless of the profile synthesis option used, we ought to maintain the method's entry weight, which is computed by summing all non-flow weight into the entry block. Ideally, we'd use fgCalledCount here, but this isn't computed until after morph, and we need to tolerate the existence of multiple entry blocks for methods with OSR pre-morph.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 29, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs show numerous methods where we can compute a more precise call count, rather than falling back to BB_UNITY_WEIGHT. Thanks!

@AndyAyersMS
Copy link
Member

There's kind of a chicken and egg problem here, because we're about to recompute the weights of all the blocks that are preds of the entry block, so relying on the existing weights of those blocks seems a bit odd.

Can you post a simple(-ish) non-OSR example where this changes things?

If the entry is a loop head (not sure that is possible anymore with ominpresent scratch BB) there is a $C_p$ and you could possibly rely on that instead, as it will have already computed how much flow comes back to the entry from its preds.

@jakobbotsch
Copy link
Member

I had #110693 where I tried to compute fgCalledCount during profile incorporation, but IIRC I ended up with some massive diffs that I did not look into.

@amanasifkhalid
Copy link
Member Author

If the entry is a loop head (not sure that is possible anymore with ominpresent scratch BB) there is a $C_p$ and you could possibly rely on that instead, as it will have already computed how much flow comes back to the entry from its preds.

Thanks for pointing this out; this seems better than relying on the old weights. One thing I notice with this approach is ComputeCyclicProbabilities sets all loop headers' weights to 1.0, so we end up deriving an entry weight <= 1.0, giving the impression that methods beginning with loops are infrequently called (though I suppose relying on something canned like BB_UNITY_WEIGHT isn't necessarily better). Is an accurate call count unknowable in these cases? I suppose this isn't all that limiting considering the few places we use fgCalledCount to drive decisions.

Can you post a simple(-ish) non-OSR example where this changes things?

Sure. For Microsoft.CodeAnalysis.CSharp.CSharpCompilation+UsingsFromOptionsAndDiagnostics:Complete in benchmarks.run_pgo, the baseline JIT bails when it sees the backedge to fgFirstBB, and uses BB_UNITY_WEIGHT as the entry weight:

BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  2       BB10                100.08 100 [000..016)-> BB08(0.999),BB02(0.000766)    ( cond )                     IBC bwd bwd-target
BB02 [0001]  1       BB01                  0.08   0 [016..01B)-> BB05(0.5),BB03(0.5)     ( cond )                     IBC bwd
BB03 [0002]  1       BB02                  0.04   0 [01B..020)-> BB07(0),BB04(1)         ( cond )                     IBC bwd
BB04 [0003]  1       BB03                  0.04   0 [020..022)-> BB09(1)                 (always)                     IBC bwd
BB05 [0004]  1       BB02                  0.04   0 [022..031)-> BB10(0),BB06(1)         ( cond )                     IBC bwd
BB06 [0005]  1       BB05                  0.04   0 [031..048)-> BB10(1)                 (always)                     IBC bwd
BB07 [0006]  1       BB03                  0      0 [048..058)-> BB10(1)                 (always)                     IBC rare bwd
BB08 [0007]  1       BB01                100.00 100 [058..059)                           (return)                     IBC
BB09 [0008]  1       BB04                  0.04   0 [059..06A)-> BB10(1)                 (always)                     IBC bwd
BB10 [0009]  4       BB05,BB06,BB07,BB09   0.08   0 [06A..079)-> BB01(1)                 (always)                     IBC bwd bwd-src
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Whereas if we derive the entry weight from the loop's cyclic probability:

For loop at BB01 cyclic weight is 0.0007662835 cyclic probability is 1.000767
Synthesis: entry BB01 has input weight 0.9992337
Synthesis: exception weight 1e-05
Computing block weights
Synthesis solver: flow graph has 0 improper loop headers
converged at iteration 0 rel residual inf eigenvalue 0
Checking Profile Weights (flags:0x24)
  Method entry BB08 is loop head, can't check entry/exit balance
Profile is self-consistent (10 profiled blocks, 0 unprofiled)

*************** Finishing PHASE Profile incorporation
Trees after Profile incorporation

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  2       BB10                  1      1 [000..016)-> BB08(0.999),BB02(0.000766)    ( cond )                     IBC bwd bwd-target
BB02 [0001]  1       BB01                  0.00   0 [016..01B)-> BB05(0.5),BB03(0.5)     ( cond )                     IBC bwd
BB03 [0002]  1       BB02                  0.00   0 [01B..020)-> BB07(0),BB04(1)         ( cond )                     IBC bwd
BB04 [0003]  1       BB03                  0.00   0 [020..022)-> BB09(1)                 (always)                     IBC bwd
BB05 [0004]  1       BB02                  0.00   0 [022..031)-> BB10(0),BB06(1)         ( cond )                     IBC bwd
BB06 [0005]  1       BB05                  0.00   0 [031..048)-> BB10(1)                 (always)                     IBC bwd
BB07 [0006]  1       BB03                  0      0 [048..058)-> BB10(1)                 (always)                     IBC rare bwd
BB08 [0007]  1       BB01                  1.00   1 [058..059)                           (return)                     IBC
BB09 [0008]  1       BB04                  0.00   0 [059..06A)-> BB10(1)                 (always)                     IBC bwd
BB10 [0009]  4       BB05,BB06,BB07,BB09   0.00   0 [06A..079)-> BB01(1)                 (always)                     IBC bwd bwd-src
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

I had #110693 where I tried to compute fgCalledCount during profile incorporation, but IIRC I ended up with some massive diffs that I did not look into.

I'm tempted to revive this, since we'd ideally compute this early when the profile is still consistent: I suspect some of diffs you got on that PR had to do with OSR methods having nonsensical weights on fgFirstBB.

@jakobbotsch
Copy link
Member

I'm tempted to revive this, since we'd ideally compute this early when the profile is still consistent: I suspect some of diffs you got on that PR had to do with OSR methods having nonsensical weights on fgFirstBB.

Sounds right. IMO it would be best to go that route since otherwise we may just end up churning things twice... fgCalledCount definitely seems to be the thing we want to use here.

@amanasifkhalid
Copy link
Member Author

fgCalledCount definitely seems to be the thing we want to use here.

I think we still run into an ordering problem where we don't know how much flow makes it back into the entry block until profile synthesis has run, but perhaps we can make profile synthesis responsible for updating/setting fgCalledCount (though I wouldn't expect subsequent runs of profile synthesis to change the initial answer)?

@AndyAyersMS
Copy link
Member

If you have a $C_p$ and the entry block originally had count $X$, then the called count is $X/C_p$. If you make that the input weight then later on when the code scales up the entry block it will set the count to $(X/C_p) * C_p = X$ so you end up back where you started.

If the entry block has backedges I believe we'll always find a loop there, since there is no other possible entry (ignoring OSR for the time being).

@amanasifkhalid
Copy link
Member Author

If you have a $C_p$ and the entry block originally had count $X$, then the called count is $X/C_p$. If you make that the input weight then later on when the code scales up the entry block it will set the count to $(X/C_p) * C_p = X$ so you end up back where you started.

So in the case where the entry block is also a loop header, we have to cache the block's original weight before running ComputeCyclicProbabilities so we can get the correct call count in AssignInputWeights. I don't love how the API shape looks, but I updated my approach to cache the profile-derived entry block count, and use it to compute the call count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants