Skip to content

[Branch Hints] Add a utility to compare with metadata, and use it in merging opt passes #7733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions src/ir/metadata.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -105,20 +105,25 @@ bool equal(Function* a, Function* b) {
return false;
}

return equal(a->body, b->body, a, b);
}

bool equal(Expression* a, Expression* b, Function* aFunc, Function* bFunc) {
assert(aFunc && bFunc);

// TODO: We do not consider debug locations here. This is often what is
// desired in optimized builds (e.g. if we are trying to fold two
// pieces of code together, that benefit outweighs slightly inaccurate
// debug info). If we find that non-optimizer locations call this in
// ways that lead to degraded debug info, we could add an option to
// control it.

if (a->codeAnnotations.empty() && b->codeAnnotations.empty()) {
if (aFunc->codeAnnotations.empty() && bFunc->codeAnnotations.empty()) {
// Nothing to compare; no differences.
return true;
}

Serializer aList(a->body);
Serializer bList(b->body);
Serializer aList(a);
Serializer bList(b);

if (aList.list.size() != bList.list.size()) {
return false;
Expand All @@ -128,8 +133,8 @@ bool equal(Function* a, Function* b) {
for (Index i = 0; i < aList.list.size(); i++) {
if (!compare(aList.list[i],
bList.list[i],
a->codeAnnotations,
b->codeAnnotations,
aFunc->codeAnnotations,
bFunc->codeAnnotations,
Function::CodeAnnotation())) {
return false;
}
Expand Down
4 changes: 4 additions & 0 deletions src/ir/metadata.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ void copyBetweenFunctions(Expression* origin,
// after all else is known equal).
bool equal(Function* a, Function* b);

// Check if two expressions are equal in metadata. They may or may not be from
// the same function.
bool equal(Expression* a, Expression* b, Function* aFunc, Function* bFunc);

} // namespace wasm::metadata

#endif // wasm_ir_metadata_h
16 changes: 16 additions & 0 deletions src/ir/utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
#define wasm_ir_utils_h

#include "ir/branch-utils.h"
#include "ir/metadata.h"
#include "pass.h"
#include "wasm-builder.h"
#include "wasm-traversal.h"
Expand Down Expand Up @@ -69,6 +70,21 @@ struct ExpressionAnalyzer {
return flexibleEqual(left, right, comparer);
}

// Compare two expressions and their metadata as well. If just the first
// function is provided, we consider them both to arrive from the same one.
static bool equalIncludingMetadata(Expression* left,
Expression* right,
Function* leftFunc = nullptr,
Function* rightFunc = nullptr) {
if (!equal(left, right)) {
return false;
}
if (!rightFunc) {
rightFunc = leftFunc;
}
return metadata::equal(left, right, leftFunc, rightFunc);
}

// A shallow comparison, ignoring child nodes.
static bool shallowEqual(Expression* left, Expression* right) {
auto comparer = [left, right](Expression* currLeft, Expression* currRight) {
Expand Down
15 changes: 9 additions & 6 deletions src/passes/CodeFolding.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -249,8 +249,8 @@ struct CodeFolding
// run the rest of the optimization mormally.
auto maybeAddBlock = [this](Block* block, Expression*& other) -> Block* {
// If other is a suffix of the block, wrap it in a block.
if (block->list.empty() ||
!ExpressionAnalyzer::equal(other, block->list.back())) {
if (block->list.empty() || !ExpressionAnalyzer::equalIncludingMetadata(
other, block->list.back(), getFunction())) {
return nullptr;
}
// Do it, assign to the out param `other`, and return the block.
Expand Down Expand Up @@ -395,7 +395,8 @@ struct CodeFolding
Index tail = 1;
for (; tail < tails.size(); ++tail) {
auto* other = getMergeable(tails[tail], num);
if (!other || !ExpressionAnalyzer::equal(item, other)) {
if (!other || !ExpressionAnalyzer::equalIncludingMetadata(
item, other, getFunction())) {
// Other tail too short or has a difference.
break;
}
Expand Down Expand Up @@ -673,7 +674,8 @@ struct CodeFolding
[&](Expression* item) {
if (item ==
first || // don't bother comparing the first
ExpressionAnalyzer::equal(item, first)) {
ExpressionAnalyzer::equalIncludingMetadata(
item, first, getFunction())) {
// equal, keep it
return false;
} else {
Expand All @@ -691,8 +693,9 @@ struct CodeFolding
explore.end(),
[&](Tail& tail) {
auto* item = getItem(tail, num);
return !ExpressionAnalyzer::equal(
item, correct);
return !ExpressionAnalyzer::
equalIncludingMetadata(
item, correct, getFunction());
}),
explore.end());
// try to optimize this deeper tail. if we succeed, then stop here,
Expand Down
5 changes: 5 additions & 0 deletions src/passes/LocalCSE.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,11 @@ struct HEComparer {
if (a.digest != b.digest) {
return false;
}
// Note that we do not consider metadata here. That means we may replace two
// identical expressions with different metadata, say, different branch
// hints, but that is ok: we are only removing things from executing (by
// reusing the first computed value), so this will not cause new invalid
// branch hints to execute.
return ExpressionAnalyzer::equal(a.expr, b.expr);
}
};
Expand Down
12 changes: 9 additions & 3 deletions src/passes/OptimizeInstructions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1168,16 +1168,18 @@ struct OptimizeInstructions
void visitIf(If* curr) {
curr->condition = optimizeBoolean(curr->condition);
if (curr->ifFalse) {
auto* func = getFunction();
if (auto* unary = curr->condition->dynCast<Unary>()) {
if (unary->op == EqZInt32) {
// flip if-else arms to get rid of an eqz
curr->condition = unary->value;
std::swap(curr->ifTrue, curr->ifFalse);
BranchHints::flip(curr, getFunction());
BranchHints::flip(curr, func);
}
}
if (curr->condition->type != Type::unreachable &&
ExpressionAnalyzer::equal(curr->ifTrue, curr->ifFalse)) {
ExpressionAnalyzer::equalIncludingMetadata(
curr->ifTrue, curr->ifFalse, func)) {
// The sides are identical, so fold. If we can replace the If with one
// arm and there are no side effects in the condition, replace it. But
// make sure not to change a concrete expression to an unreachable
Expand Down Expand Up @@ -3235,8 +3237,12 @@ struct OptimizeInstructions
}
}
{
// Sides are identical, fold
// If sides are identical, fold.
Expression *ifTrue, *ifFalse, *c;
// Note we do not compare metadata here: This is a select, so both arms
// execute anyhow, and things like branch hints were already being run.
// After optimization, we will only run fewer things, and run no risk of
// running new bad things.
if (matches(curr, select(any(&ifTrue), any(&ifFalse), any(&c))) &&
ExpressionAnalyzer::equal(ifTrue, ifFalse)) {
auto value = effects(ifTrue);
Expand Down
107 changes: 107 additions & 0 deletions test/lit/passes/code-folding_branch-hints.wast
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
;; NOTE: Assertions have been generated by update_lit_checks.py --all-items and should not be edited.
;; NOTE: This test was ported using port_passes_tests_to_lit.py and could be cleaned up.

;; RUN: wasm-opt %s -all --code-folding -S -o - | filecheck %s

(module
;; CHECK: (type $0 (func (param i32 i32) (result f32)))

;; CHECK: (func $different (type $0) (param $x i32) (param $y i32) (result f32)
;; CHECK-NEXT: (if
;; CHECK-NEXT: (local.get $x)
;; CHECK-NEXT: (then
;; CHECK-NEXT: (@metadata.code.branch_hint "\00")
;; CHECK-NEXT: (if
;; CHECK-NEXT: (local.get $y)
;; CHECK-NEXT: (then
;; CHECK-NEXT: (nop)
;; CHECK-NEXT: )
;; CHECK-NEXT: )
;; CHECK-NEXT: )
;; CHECK-NEXT: (else
;; CHECK-NEXT: (@metadata.code.branch_hint "\01")
;; CHECK-NEXT: (if
;; CHECK-NEXT: (local.get $y)
;; CHECK-NEXT: (then
;; CHECK-NEXT: (nop)
;; CHECK-NEXT: )
;; CHECK-NEXT: )
;; CHECK-NEXT: )
;; CHECK-NEXT: )
;; CHECK-NEXT: (f32.const 0)
;; CHECK-NEXT: )
(func $different (param $x i32) (param $y i32) (result f32)
;; The branch hints differ, so we do not optimize.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to merge and drop the hint in this case. If the only benefit of branch hints is that cold code can be placed far away, then surely just deduplicating the cold code into some warmer code is at least as good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the issue may be cold code inside each of the arms. Imagine that under some condition, matching$x, the first arm runs 30% faster (because the internal if is almost never entered, depending on $y), and that in the reverse condition, the second arm runs 30% faster (because the internal if is almost always entered). Merging the arms and removing the branch hint would make us 30% slower.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wouldn't the hypothetical speedups of 30% be achieved via better code cache locality when the cold blocks are split out? Then merging the code would achieve the same speedups.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Maybe the simple testcase here is misleading - imagine there are loops in each of the arms here, so we spend a lot of time in one arm, and the hints matter.

  2. The problem is that which blocks are cold depends on what data we are running on:

In more detail, imagine we have a function calc_mostly_0 which processes some huge buffer of data, and assumes most of the code is 0. We would want a hint there that whenever it checks if an item is 0, that is likely.

And imagine we also have calc_mostly_1 which processes similar data, but assumes most of the data is 1. The hint there would say items of 0 are unlikely, the reverse of before.

Now assume those two functions are identical in all but the branch hints. Then they get compiled to very different machine code: in the first, blocks that handle 1 are cold and moved out, while in the latter, blocks that handle 0 are cold and moved out.

And, imagine we can tell what the data properties are, at runtime: perhaps we know that data from certain sources is mostly 0s, and from others, mostly 1s, so we might do

if (probably_mostly_0) {
  calc_mostly_0();
} else {
  calc_mostly_1();
}

Merging code here, or even just removing the branch hints, would be slower.

This is a bit unlikely to happen in reality - likely there would be other differences somewhere - but it is realistic, I think, to collect separate PGO profiles and compile multiple versions of a function, if you have a way to know which profile is more relevant. Like imagine a physics engine can be tuned for many small objects or a few big and complicated ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if you merge the functions in that situation, then the merged function has more co-located code than either of the unmerged functions would have. But there would be less code overall, so potentially less code cache pressure and better performance, although that would depend on the placement of the functions and the behavior of the cache. I just don't think there's as clear a benefit to keeping the functions separate as you are arguing, and allowing optimization hints to inhibit optimizations seems like a priority inversion to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is more code overall without merging, but the hot-loop parts would remain hot loops regardless of how much code is around them. We have to assume some effect of locality like that, I think? Otherwise, if I follow your logic, we'd need to consider erasing hints after inlining (as the extra code might render them ineffective), but nothing I have read suggests that.

Also, not merging code here is necessary to preserve the fuzzing invariant that optimizations do not cause bad branch hints: If we merge, we must pick one hint and it might be wrong (causing large slowdowns). Or, if we merge and erase the hints, we may also cause a slowdown.

(I do think maybe when optimizing for size, that we may want to always merge, though.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking offline, it sounds like we just had different expectations about how useful the hints would be and therefore how hard we should try to preserve them. I hadn't been considering register allocation effects in particular. This approach sgtm, especially if it turns out that LLVM makes similar decisions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you raised a very good point with following LLVM. I had trouble finding a clear answer in the source, but here is a testcase with a select:

define i32 @simple_hoist(i32 %n, i1 %cond, i1 %cond2) {
entry:
  br i1 %cond, label %then.block, label %else.block

then.block:
  %temp1 = select i1 %cond2, i32 10, i32 -1, !prof !1
  %val_then = add i32 %n, %temp1
  br label %merge.block

else.block:
  %temp2 = select i1 %cond2, i32 10, i32 -1, !prof !2
  %val_else = add i32 %n, %temp2
  br label %merge.block

merge.block:
  %res = phi i32 [ %val_then, %then.block ], [ %val_else, %else.block ]
  ret i32 %res
}

!1 = !{!"branch_weights", i32 2000, i32 1}
!2 = !{!"branch_weights", i32 1, i32 2000}

and here is one with an if:

define i32 @simple_hoist(i32 %n, i1 %cond, i1 %cond2) {
entry:
  br i1 %cond, label %then.block, label %else.block

then.block:
  %val_then = add i32 %n, 20
  %cmp.then = icmp sgt i32 %val_then, 100
  br i1 %cmp.then, label %ret.a, label %ret.b, !prof !1

else.block:
  %val_else = add i32 %n, 20
  %cmp.else = icmp sgt i32 %val_else, 100
  br i1 %cmp.else, label %ret.a, label %ret.b, !prof !2

ret.a:
  %res1 = phi i32 [ 42, %then.block ], [ 1337, %else.block ]
  ret i32 %res1

ret.b:
  %res2 = phi i32 [ 43, %then.block ], [ 1338, %else.block ]
  ret i32 %res2
}

!1 = !{!"branch_weights", i32 2000, i32 1}
!2 = !{!"branch_weights", i32 1, i32 2000}

In both cases LLVM is perfectly happy to pick one of the two arbitrarily. It makes no effort to preserve the hint.

I am surprised by this but I think we can trust LLVM on it, so let's not land this, and I'll also revert #7715 . I will also need to figure out some other mechanism for how to fuzz this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to drop the hint on merging? If so, I wouldn't expect that to need special fuzzer support. Or is the plan to pick one arbitrarily and keep its hint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering a few possible plans for the fuzzer, either skipping relevant passes, or a flag.

Yes, I think picking one hint arbitrarily, as LLVM does, is good enough - it's simplest and presumably LLVM found it ok to do (maybe even optimal).

(if (result f32)
(local.get $x)
(then
(block (result f32)
(@metadata.code.branch_hint "\00")
(if
(local.get $y)
(then
(nop)
)
)
(f32.const 0)
)
)
(else
(block (result f32)
(@metadata.code.branch_hint "\01")
(if
(local.get $y)
(then
(nop)
)
)
(f32.const 0)
)
)
)
)

;; CHECK: (func $same (type $0) (param $x i32) (param $y i32) (result f32)
;; CHECK-NEXT: (drop
;; CHECK-NEXT: (local.get $x)
;; CHECK-NEXT: )
;; CHECK-NEXT: (@metadata.code.branch_hint "\00")
;; CHECK-NEXT: (if
;; CHECK-NEXT: (local.get $y)
;; CHECK-NEXT: (then
;; CHECK-NEXT: (nop)
;; CHECK-NEXT: )
;; CHECK-NEXT: )
;; CHECK-NEXT: (f32.const 0)
;; CHECK-NEXT: )
(func $same (param $x i32) (param $y i32) (result f32)
;; The branch hints are the same, so we optimize.
(if (result f32)
(local.get $x)
(then
(block (result f32)
(@metadata.code.branch_hint "\00")
(if
(local.get $y)
(then
(nop)
)
)
(f32.const 0)
)
)
(else
(block (result f32)
(@metadata.code.branch_hint "\00")
(if
(local.get $y)
(then
(nop)
)
)
(f32.const 0)
)
)
)
)
)
Loading
Loading