Skip to content

SPARQL engine: streaming evaluation, bind joins, and expression fixes#947

Merged
cybermaggedon merged 1 commit into
release/v2.5from
feature/sparql-stream-optimisation
May 21, 2026
Merged

SPARQL engine: streaming evaluation, bind joins, and expression fixes#947
cybermaggedon merged 1 commit into
release/v2.5from
feature/sparql-stream-optimisation

Conversation

@cybermaggedon

Copy link
Copy Markdown
Contributor

Convert the SPARQL algebra evaluator from eager list-based evaluation to lazy async generators so results stream incrementally. This lets Slice terminate early (via generator cleanup) and avoids materialising full result sets for streamable operators like Project, Filter, Union, and Extend. Blocking operators (Join, LeftJoin, OrderBy, Group) materialise at their boundary then yield.

Add bind join optimization for Join nodes where one side is small (VALUES/ToMultiSet): instead of materialising both sides independently and hash-joining, iterate the small side's bindings and evaluate the large side with those bindings pre-seeded. This turns wildcard BGP queries into selective ones — e.g. VALUES ?x { } joined with a BGP now queries the triple store with ?x bound rather than fetching all triples.

Add TriplesClient.query_gen() async generator that wraps the existing streaming callback API via an asyncio.Queue bridge, yielding individual Triple objects as batches arrive.

Add streaming request path in the SPARQL query service that batches solutions from the live async generator and sends them as they fill.

Fix FILTER IN/NOT IN: rdflib represents these as RelationalExpression nodes with op="IN", not as Builtin_IN — handle both representations.

Fix Builtin_IN/Builtin_NOTIN dispatch ordering so the specific handlers are checked before the generic Builtin_ prefix match.

Fix VALUES handling for rdflib's two representations: positional (var/value) and dict-based (res).

Convert the SPARQL algebra evaluator from eager list-based evaluation to
lazy async generators so results stream incrementally. This lets Slice
terminate early (via generator cleanup) and avoids materialising full
result sets for streamable operators like Project, Filter, Union, and
Extend. Blocking operators (Join, LeftJoin, OrderBy, Group) materialise
at their boundary then yield.

Add bind join optimization for Join nodes where one side is small
(VALUES/ToMultiSet): instead of materialising both sides independently
and hash-joining, iterate the small side's bindings and evaluate the
large side with those bindings pre-seeded. This turns wildcard BGP
queries into selective ones — e.g. VALUES ?x { <uri> } joined with a
BGP now queries the triple store with ?x bound rather than fetching
all triples.

Add TriplesClient.query_gen() async generator that wraps the existing
streaming callback API via an asyncio.Queue bridge, yielding individual
Triple objects as batches arrive.

Add streaming request path in the SPARQL query service that batches
solutions from the live async generator and sends them as they fill.

Fix FILTER IN/NOT IN: rdflib represents these as RelationalExpression
nodes with op="IN", not as Builtin_IN — handle both representations.

Fix Builtin_IN/Builtin_NOTIN dispatch ordering so the specific handlers
are checked before the generic Builtin_ prefix match.

Fix VALUES handling for rdflib's two representations: positional
(var/value) and dict-based (res).
@cybermaggedon cybermaggedon merged commit 6af12f4 into release/v2.5 May 21, 2026
2 checks passed
@cybermaggedon cybermaggedon deleted the feature/sparql-stream-optimisation branch May 21, 2026 14:49
@github-actions

Copy link
Copy Markdown

Contributor License Agreement ✅

All contributors have signed the CLA. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant