Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inlining #to:do: #35

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Inlining #to:do: #35

wants to merge 16 commits into from

Conversation

smarr
Copy link
Member

@smarr smarr commented Aug 5, 2021

This PR inlines #to:do: and its block.

For the interpreters, this is generally a good win.
The SomSom benchmarks see run-time reductions of 5-8% for the AST interpreter, and 7-9% for the bytecode interpreter.

For the micro and macro benchmarks on the interpreters, the benefits can be up to 37% on the Sum microbenchmark for the AST interpreter, and 30% on the Loop microbenchmark for the BC interpreter. Though, the BC interpreter seems to benefit more.

However, steady-state JIT performance is a different picture. DeltaBlue increases with 9% and NBody with 24% after this change. For the AST interpreter, this also includes already mitigations in the ToDoInlined node, which nils out the loop index variable so that it can be optimized better by the compiler. The key issue here is that the lifetime of the loop index variable is much less clearly defined as before, and the compiler struggles optimizing.

For the bytecode interpreter things are worse, and at this point, I do not yet have a solution.

The issue is illustrated with the Dispatch microbenchmark. Trace: https://gist.github.com/smarr/08b234b741b9ffd2b77cee49836d1332

Because of the inlining, the loop index is now on the stack, and on the frame.
This prolongs the lifetime, and because the compiler doesn't do well with doing optimizations around guards, it keeps a lot of redundant operations in the trace.

At this pint, PageRank's run time increases by 100%, NBody by 27%, Dispatch by 143%, Loop by 89%.
Thus, it comes at a significant cost. There are also some run time reductions, but the overheads/increases seem to be severe.

https://rebench.stefan-marr.de/compare/RPySOM/3eb75e51b4d02c327f03dd5f01f2cf3f77bec220/13435b759281863f6a3458e006f3e728fc51593f#macro-steady-RPySOM-ast-jit

@smarr smarr force-pushed the down-to-do-prim branch 2 times, most recently from dcdefbc to 8bb081f Compare August 6, 2021 15:12
@smarr smarr changed the title Inlining #and:/#&&/#or:/#||/#to:do: and primitive for #downTo:do: Inlining #to:do: Feb 8, 2023
smarr added 8 commits February 8, 2023 17:36
…uring inling

Block arguments are turned into locals during inlining.
This might not be perfect for the general case, but works well for to:do:.

This commit also adapts the variable write nodes so that I can pass in the value directly, and don't have to have a child node.
This requires some adaptation in the node traversal methods in abstract_node.py to handle the case that child nodes may be legitimately None.

This also moves the lexical scope into the abstract method.
Drawback is that also trivial methods have the extra field.

Signed-off-by: Stefan Marr <[email protected]>
There's a failing assertion on logging with PYPYLOG, which is hopefully fixed with this.

Signed-off-by: Stefan Marr <[email protected]>
The biggest drawback of this design is that the lifetime of the frame is longer than the block activation in the loop, which is a problem for performance, because the compiler materializes the store.

Using `if we_are_jitted()` prevents the unnecessary store in the trace, and at the same time, avoids much of the overhead in the interpreter.

Signed-off-by: Stefan Marr <[email protected]>
This requires two new bytecodes DUP_SECOND (in addition to DUP, which duplicates the first/top element of the stack), and JUMP_IF_GREATER

Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
smarr added a commit to SOM-st/SOMpp that referenced this pull request Aug 2, 2024
This is based on SOM-st/PySOM#35 which was never
merged because of the impact on the JIT compiler.
Though, since we have a pure interpreter here, it's a good win on
benchmarks that use a `#to:do:` loop.

The median run time is reduced by only 2% though.


https://rebench.dev/SOMpp/compare/692cc6c47727a7fa76923daf92e0bb8c7d4be3b1..252923132894fff18b7d1364dfb2981f439f146c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant