Skip to content

Deep match amb child memoization #2310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: feat/error-tree-support
Choose a base branch
from

Conversation

PieterOlivier
Copy link
Contributor

This PR implements amb child node memoization for deep (descendent) matches.

This is needed because trees with error nodes often have nested ambiguity nodes with a lot of identical nodes.
By using memoization during deep matches, identitical nodes are only visited once.

// There will only be 3 matches if deep matches are memoized, 6 if they are not.
return count == 3;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a test where we match on the amb node itself (0 | it + 1 | /amb(alts) <- ambTree), because it's a corner case.

private void pushAmb(ITree amb) {
if (debug) System.err.println("pushAmb: " + amb);
spine.push(amb);
for (IValue alt : TreeAdapter.getAlternatives(amb)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not hash on the entire amb cluster? then there are 50% (or more) fewer nodes to cache.

@@ -123,7 +147,12 @@ private void pushConcreteSyntaxNode(ITree tree){
if (TreeAdapter.isAmb(tree)) {
// only recurse
for (IValue alt : TreeAdapter.getAlternatives(tree)) {
pushConcreteSyntaxNode((ITree) alt);
if (!visitedAmbChildren.contains(alt)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, we could factor "visitedAmbs.contains(tree)" for the entire cluster and not recurse into the children if we did this already before.

@@ -35,6 +36,7 @@
public class DescendantReader implements Iterator<IValue> {

Stack<Object> spine = new Stack<Object>();
private Set.Transient<IValue> visitedAmbChildren = Set.Transient.of();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why whould a normal HashSet not be enough? We're not sharing these sets, we're not doing any immutable updates on them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash set has a huge footprint of empty array cells compared to this trieset. We have a very very sparse set here so the trieset is the optimal datastructure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is a mutable trieset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this is the transient set, I was just wondering why it's used (I see the value when it's used to in the end make an immutable version).

If only for the empty case, we could special case it to start with null, and initialize it the first time we encounter a amb node?

Anyway, I was just wondering why in this specific case this set was used, while the interpreter is mostly full of standard jdk containers. (Or vallang ofcourse)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a simply conscious choice for the most optimal hash set implementation; we need a hash set but we don't need to pay for all the zeroes. The empty hash trie set is already one or more orders of magnitude (10 times to 100 times) smaller than then it's array-based counterpart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the trick with a null table is not even necessary if we use this transient set. The chances are very big that we encounter only one amb node, which is when the trieset outshines the table implementation.

Copy link
Member

@jurgenvinju jurgenvinju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works, but it could be optimized by caching the entire cluster instead of each individual alternative within the cluster. Wdyt @PieterOlivier ?

@PieterOlivier
Copy link
Contributor Author

I remember seeing cases where multiple ambiguity clusters had children that where mostly identical but not all of them. In that case caching the children allows more memoization. I do not know how common this is.

Maybe we should focus on what is most natural for the user? The "cache the ambiguity cluster itself" solution probably wins here.

I will make some measurements with the current solution and then implement memoization at the amb node level and measure again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants