SchemaFieldEntities seem to cause performance issues #11417

guentherhackl-wgs · 2024-09-18T13:11:38Z

Describe the bug
The symptom of the issue was that the schema of a kafka topic would not load and after ~30s it dispayed "No Data" for the schema although the schemaMetadataAspect was visible in the database.

In the GMS Logs I could see that the graphQL query to retrieve the schemaMetadata caused an async timeout (and still finished after 200s)

2024-09-18 12:40:20,823 [qtp1213216872-8203] INFO  c.datahub.graphql.GraphQLController:125 - Processing request, operation: getSchemaBlame, actor urn: urn:li:corpuser:guenther.hackl
2024-09-18 12:40:20,824 [ForkJoinPool.commonPool-worker-4154] INFO  c.datahub.graphql.GraphQLController:130 - Executing operation getSchemaBlame for qtp1213216872-8203
2024-09-18 12:40:20,847 [ForkJoinPool.commonPool-worker-4159] INFO  c.datahub.graphql.GraphQLController:153 - Executed operation getVersionedDataset in 26 ms
2024-09-18 12:40:20,849 [ForkJoinPool.commonPool-worker-4159] INFO  c.datahub.graphql.GraphQLController:158 - Operation getVersionedDataset execution result size: 90772
2024-09-18 12:40:20,874 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:61 - Successfully fed bulk request 3341. Number of events: 6 Took time ms: 12
2024-09-18 12:40:21,033 [ForkJoinPool.commonPool-worker-4154] INFO  c.datahub.graphql.GraphQLController:153 - Executed operation getSchemaBlame in 208 ms
2024-09-18 12:40:21,034 [ForkJoinPool.commonPool-worker-4154] INFO  c.datahub.graphql.GraphQLController:158 - Operation getSchemaBlame execution result size: 101855
2024-09-18 12:40:50,507 [qtp1213216872-8294] WARN  o.s.w.s.m.s.DefaultHandlerExceptionResolver:247 - Resolved [org.springframework.web.context.request.async.AsyncRequestTimeoutException]
2024-09-18 12:42:40,231 [ForkJoinPool.commonPool-worker-22] INFO  c.datahub.graphql.GraphQLController:153 - Executed operation getDatasetSchema in 100594 ms
2024-09-18 12:42:40,280 [ForkJoinPool.commonPool-worker-22] INFO  c.datahub.graphql.GraphQLController:158 - Operation getDatasetSchema execution result size: 226375

No resource restrictions showed up on any graph in our graphana. There were seemingly enough cpu and ram for GMS and OpenSearch. THe times for the other request don't look that bad to me.

I've gone through the graphql query and deleted parts to narrow down the issue and was able to find the following minimal query to produce it in my setup, which indicated the involvement of the SchemaFieldEntities:

query getDatasetSchema($urn: String!) {
  dataset(urn: $urn) {
    ...datasetSchema
  }
}

fragment datasetSchema on Dataset {
  schemaMetadata(version: 0) {
    ...schemaMetadataFields
  }
}

fragment schemaMetadataFields on SchemaMetadata {
  fields {
    ...schemaFieldFields
  }
}  

fragment schemaFieldFields on SchemaField {

  schemaFieldEntity {
    urn
    type
  }
}

After setting the feature flag for SchemaFieldEntities to false (SCHEMA_FIELD_ENTITY_FETCH_ENABLED=false) it significantly improved.

There seem to be two factors which have an effect on this:

size of the schema: the schema in question was an avro schema with a bout 120 fields. smaller schemas could be loaded. there were also some schemas which were on the verge of the timeout, sometimes they could be loaded sometimes not
number of entities in datahub: just setting it up locally with only this one schema the loading was no problem. on our dev deployment where not a lot of datasets are loaded it still worked. On test it did not work, and prod with the largest number of datasets (61k) also did not work and the log took longer to appear after the AsyncTimeout.

To Reproduce
Just loading the affected large schema to a local datahub did not produce the issue. The number of entities seem to be important to. I have not yet tried it, but I would assume to reproduce a significant amount of datasets is needed locally

Expected behavior
Show the schema when it is verifiably in the database

Desktop (please complete the following information):

OS: MacOS 15
Browser: Chrome
Datahub-Version: v0.13.3

The text was updated successfully, but these errors were encountered:

david-leifker · 2024-09-19T20:10:20Z

We'll get a PR up to improve the performance of the graphql call. There has been at least one possible mitigation to improve performance in this case to reduce redundant access checks in #11327 part of v0.14.1.

david-leifker · 2024-09-19T20:41:11Z

#11434 should hopefully fix this issue.

guentherhackl-wgs added the bug Bug report label Sep 18, 2024

david-leifker closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SchemaFieldEntities seem to cause performance issues #11417

SchemaFieldEntities seem to cause performance issues #11417

guentherhackl-wgs commented Sep 18, 2024

david-leifker commented Sep 19, 2024

david-leifker commented Sep 19, 2024

SchemaFieldEntities seem to cause performance issues #11417

SchemaFieldEntities seem to cause performance issues #11417

Comments

guentherhackl-wgs commented Sep 18, 2024

david-leifker commented Sep 19, 2024

david-leifker commented Sep 19, 2024