Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SchemaFieldEntities seem to cause performance issues #11417

Closed
guentherhackl-wgs opened this issue Sep 18, 2024 · 2 comments
Closed

SchemaFieldEntities seem to cause performance issues #11417

guentherhackl-wgs opened this issue Sep 18, 2024 · 2 comments
Labels
bug Bug report

Comments

@guentherhackl-wgs
Copy link

Describe the bug
The symptom of the issue was that the schema of a kafka topic would not load and after ~30s it dispayed "No Data" for the schema although the schemaMetadataAspect was visible in the database.

In the GMS Logs I could see that the graphQL query to retrieve the schemaMetadata caused an async timeout (and still finished after 200s)

2024-09-18 12:40:20,823 [qtp1213216872-8203] INFO  c.datahub.graphql.GraphQLController:125 - Processing request, operation: getSchemaBlame, actor urn: urn:li:corpuser:guenther.hackl
2024-09-18 12:40:20,824 [ForkJoinPool.commonPool-worker-4154] INFO  c.datahub.graphql.GraphQLController:130 - Executing operation getSchemaBlame for qtp1213216872-8203
2024-09-18 12:40:20,847 [ForkJoinPool.commonPool-worker-4159] INFO  c.datahub.graphql.GraphQLController:153 - Executed operation getVersionedDataset in 26 ms
2024-09-18 12:40:20,849 [ForkJoinPool.commonPool-worker-4159] INFO  c.datahub.graphql.GraphQLController:158 - Operation getVersionedDataset execution result size: 90772
2024-09-18 12:40:20,874 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:61 - Successfully fed bulk request 3341. Number of events: 6 Took time ms: 12
2024-09-18 12:40:21,033 [ForkJoinPool.commonPool-worker-4154] INFO  c.datahub.graphql.GraphQLController:153 - Executed operation getSchemaBlame in 208 ms
2024-09-18 12:40:21,034 [ForkJoinPool.commonPool-worker-4154] INFO  c.datahub.graphql.GraphQLController:158 - Operation getSchemaBlame execution result size: 101855
2024-09-18 12:40:50,507 [qtp1213216872-8294] WARN  o.s.w.s.m.s.DefaultHandlerExceptionResolver:247 - Resolved [org.springframework.web.context.request.async.AsyncRequestTimeoutException]
2024-09-18 12:42:40,231 [ForkJoinPool.commonPool-worker-22] INFO  c.datahub.graphql.GraphQLController:153 - Executed operation getDatasetSchema in 100594 ms
2024-09-18 12:42:40,280 [ForkJoinPool.commonPool-worker-22] INFO  c.datahub.graphql.GraphQLController:158 - Operation getDatasetSchema execution result size: 226375

No resource restrictions showed up on any graph in our graphana. There were seemingly enough cpu and ram for GMS and OpenSearch. THe times for the other request don't look that bad to me.

I've gone through the graphql query and deleted parts to narrow down the issue and was able to find the following minimal query to produce it in my setup, which indicated the involvement of the SchemaFieldEntities:

query getDatasetSchema($urn: String!) {
  dataset(urn: $urn) {
    ...datasetSchema
  }
}

fragment datasetSchema on Dataset {
  schemaMetadata(version: 0) {
    ...schemaMetadataFields
  }
}

fragment schemaMetadataFields on SchemaMetadata {
  fields {
    ...schemaFieldFields
  }
}  

fragment schemaFieldFields on SchemaField {

  schemaFieldEntity {
    urn
    type
  }
}

After setting the feature flag for SchemaFieldEntities to false (SCHEMA_FIELD_ENTITY_FETCH_ENABLED=false) it significantly improved.

There seem to be two factors which have an effect on this:

  • size of the schema: the schema in question was an avro schema with a bout 120 fields. smaller schemas could be loaded. there were also some schemas which were on the verge of the timeout, sometimes they could be loaded sometimes not
  • number of entities in datahub: just setting it up locally with only this one schema the loading was no problem. on our dev deployment where not a lot of datasets are loaded it still worked. On test it did not work, and prod with the largest number of datasets (61k) also did not work and the log took longer to appear after the AsyncTimeout.

To Reproduce
Just loading the affected large schema to a local datahub did not produce the issue. The number of entities seem to be important to. I have not yet tried it, but I would assume to reproduce a significant amount of datasets is needed locally

Expected behavior
Show the schema when it is verifiably in the database

Desktop (please complete the following information):

  • OS: MacOS 15
  • Browser: Chrome
  • Datahub-Version: v0.13.3
@guentherhackl-wgs guentherhackl-wgs added the bug Bug report label Sep 18, 2024
@david-leifker
Copy link
Collaborator

We'll get a PR up to improve the performance of the graphql call. There has been at least one possible mitigation to improve performance in this case to reduce redundant access checks in #11327 part of v0.14.1.

@david-leifker
Copy link
Collaborator

#11434 should hopefully fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

2 participants