Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partitions: 8 leads to inefficient COUNT wrapper #641

Open
pascalwhoop opened this issue Jul 26, 2024 · 4 comments
Open

partitions: 8 leads to inefficient COUNT wrapper #641

pascalwhoop opened this issue Jul 26, 2024 · 4 comments
Labels

Comments

@pascalwhoop
Copy link

pascalwhoop commented Jul 26, 2024

Expected Behavior (Mandatory)

partitions: 8 should not lead to performance degradation upon read
https://neo4j.com/docs/spark/current/read/options/

Actual Behavior (Mandatory)

The query

      MATCH (n: Entity) RETURN n.id as id, n.embedding as embedding

is converted into

CALL { MATCH (n: Entity) RETURN n.id as id, n.embedding as embedding }
RETURN count(*) AS count

Which means it retrieves all data from DB, then counts them, then calculates partitions size and reads again rather than directly doing a count(*) (which is much faster).

is this because we're using the spark 'query' based reading?

@ahxxm
Copy link

ahxxm commented Jul 27, 2024

a workaround I'm using is pass the count query by read_options["query.count"] = "MATCH (n: Entity) RETURN COUNT(*) as count", it then uses the whatever meta store to get node count in O(1). This is especially needed when scanning edges, because that store only supports specifying edge type + 1 node type for O1 access

Additionally, the connector will convert partition into SKIP+LIMIT(where each executor still need to SKIP then read LIMIT.........), and the dataframe will contain duplicates when your db is also under writes

@pascalwhoop
Copy link
Author

pascalwhoop commented Jul 27, 2024 via email

@ahxxm
Copy link

ahxxm commented Jul 28, 2024

yeah the connector just wraps simple operations without heuristics.. by "labels parameter" do you mean some indexed fields that enable a cursur-based query?

@fbiville
Copy link
Contributor

I guess at the moment the queries are treated as a string not decomposed into a query that can be optimized by a query planner as eg spark does it

That's correct, the Spark connector currently treats Cypher queries as black boxes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants