forked from smarr/PySOM
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inlining #to:do: #35
Open
smarr
wants to merge
16
commits into
SOM-st:master
Choose a base branch
from
smarr:down-to-do-prim
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Inlining #to:do: #35
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
smarr
force-pushed
the
down-to-do-prim
branch
2 times, most recently
from
August 6, 2021 15:12
dcdefbc
to
8bb081f
Compare
smarr
force-pushed
the
down-to-do-prim
branch
from
February 8, 2023 16:56
3e1dc4f
to
a0055e5
Compare
smarr
changed the title
Inlining #and:/#&&/#or:/#||/#to:do: and primitive for #downTo:do:
Inlining #to:do:
Feb 8, 2023
…uring inling Block arguments are turned into locals during inlining. This might not be perfect for the general case, but works well for to:do:. This commit also adapts the variable write nodes so that I can pass in the value directly, and don't have to have a child node. This requires some adaptation in the node traversal methods in abstract_node.py to handle the case that child nodes may be legitimately None. This also moves the lexical scope into the abstract method. Drawback is that also trivial methods have the extra field. Signed-off-by: Stefan Marr <[email protected]>
There's a failing assertion on logging with PYPYLOG, which is hopefully fixed with this. Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
The biggest drawback of this design is that the lifetime of the frame is longer than the block activation in the loop, which is a problem for performance, because the compiler materializes the store. Using `if we_are_jitted()` prevents the unnecessary store in the trace, and at the same time, avoids much of the overhead in the interpreter. Signed-off-by: Stefan Marr <[email protected]>
This requires two new bytecodes DUP_SECOND (in addition to DUP, which duplicates the first/top element of the stack), and JUMP_IF_GREATER Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
smarr
force-pushed
the
down-to-do-prim
branch
from
February 8, 2023 17:36
a0055e5
to
2e5e16d
Compare
…ining for them Signed-off-by: Stefan Marr <[email protected]>
… to handle double receivers Signed-off-by: Stefan Marr <[email protected]>
…other trivial methods needs similar fixes Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
smarr
added a commit
to SOM-st/SOMpp
that referenced
this pull request
Aug 2, 2024
This is based on SOM-st/PySOM#35 which was never merged because of the impact on the JIT compiler. Though, since we have a pure interpreter here, it's a good win on benchmarks that use a `#to:do:` loop. The median run time is reduced by only 2% though. https://rebench.dev/SOMpp/compare/692cc6c47727a7fa76923daf92e0bb8c7d4be3b1..252923132894fff18b7d1364dfb2981f439f146c
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR inlines
#to:do:
and its block.For the interpreters, this is generally a good win.
The SomSom benchmarks see run-time reductions of 5-8% for the AST interpreter, and 7-9% for the bytecode interpreter.
For the micro and macro benchmarks on the interpreters, the benefits can be up to 37% on the Sum microbenchmark for the AST interpreter, and 30% on the Loop microbenchmark for the BC interpreter. Though, the BC interpreter seems to benefit more.
However, steady-state JIT performance is a different picture. DeltaBlue increases with 9% and NBody with 24% after this change. For the AST interpreter, this also includes already mitigations in the
ToDoInlined
node, whichnil
s out the loop index variable so that it can be optimized better by the compiler. The key issue here is that the lifetime of the loop index variable is much less clearly defined as before, and the compiler struggles optimizing.For the bytecode interpreter things are worse, and at this point, I do not yet have a solution.
The issue is illustrated with the Dispatch microbenchmark. Trace: https://gist.github.com/smarr/08b234b741b9ffd2b77cee49836d1332
Because of the inlining, the loop index is now on the stack, and on the frame.
This prolongs the lifetime, and because the compiler doesn't do well with doing optimizations around guards, it keeps a lot of redundant operations in the trace.
At this pint, PageRank's run time increases by 100%, NBody by 27%, Dispatch by 143%, Loop by 89%.
Thus, it comes at a significant cost. There are also some run time reductions, but the overheads/increases seem to be severe.
https://rebench.stefan-marr.de/compare/RPySOM/3eb75e51b4d02c327f03dd5f01f2cf3f77bec220/13435b759281863f6a3458e006f3e728fc51593f#macro-steady-RPySOM-ast-jit