Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datahub-upgrade RestoreIndices process performance is poor on large sets of data #11276

Open
nmbryant opened this issue Aug 30, 2024 · 3 comments
Labels
bug Bug report

Comments

@nmbryant
Copy link
Contributor

nmbryant commented Aug 30, 2024

Describe the bug
The datahub-upgrade RestoreIndices process significantly degrades as it runs when running against a large set of data. In our lower environment we have 9 million aspects and in our other environments we have 50+ million. With urnBasedPagination set to false, threads set to 10 and running in dev on all data, it was able to process 1 million in 20 minutes which was reasonably. However, as it runs the estimate to complete starts getting longer and longer with each batch taking significantly more time. After running for over 12 hours, it had only processed 4 million aspects before I stopped it.

Because of this, back in December we implemented urnBasedPagination which had linear performance regardless of the number of aspects. When we implemented it, RestoreIndices was able to process all 9M aspects in the dev environment in less than an hour. However, after upgrading to v0.12.1 from v0.10.5 the process is much slower with urnBasedPagination although still linear execution time, taking 1 hour to process 600K aspects.

To Reproduce
Steps to reproduce the behavior:

  1. Run the RestoreIndices process of datahub-upgrade on all aspects, need at least a few million aspects to reproduce
  2. With urnBasedPagination set to false and threads at 10, should see the performance of each batch degrade over time
  3. With urnBasedPagination set to true, should see linear performance but slow at 600K aspects per hour

Expected behavior
Expect the process when threading and urnBasedPagination to false to be able to run against a large set of data in a reasonable amount of time. As it is now, it is not usable for large amounts of data.
With urnBasedPagination set to true, would like the performance restored to what it was when we first implemented it.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: iOS
  • Browser: Chrome
  • Version v0.13.3

Additional context
My guess is that it's the change from PagedList to stream in EbeanAspectDao.
This is the version of EbeanAspectDao that is performant for us: https://github.com/nmbryant/datahub/blob/3ed6102d875b40452354d4044d7b89724c702ab6/metadata-io/src/main/java/com/linkedin/metadata/entity/ebean/EbeanAspectDao.java#L464

From my testing, it appears that it became slower after this change: 9a0a53b

@david-leifker What was the reason for switching it to a stream and do you think I would be able to easily switch back to a PagedList to see if that would improve the performance?

@nmbryant nmbryant added the bug Bug report label Aug 30, 2024
@nmbryant
Copy link
Contributor Author

nmbryant commented Sep 4, 2024

After some further analysis I was able to make some progress. I was able to restore the performance by adding the original restoreIndices function to EntityServiceImpl and adding getPagedAspects to EbeanAspectDao.

Comparing the results between these changes and what's currently in main, there appear to be 2 parts that are significantly slower with the newer code.

  1. This step causes significant slowdown and performance for the full restore was improved a lot by commenting it out. When doing a full restore, it seems that running this step isn't really viable or needs to be improved. Should there be an env var to disable this step or should improvements be made to it's performance?

  2. The SQL queries take significantly longer compared to with the old code. Here are is a comparison of the first 5000 batch between the new code and the old code:

    # Old Code
    2024-09-03 19:54:05,281 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - metrics so far RestoreIndicesResult(ignored=0, rowsMigrated=5000, timeSqlQueryMs=3, timeGetRowMs=0, timeUrnMs=18, timeEntityRegistryCheckMs=13, aspectCheckMs=3, createRecordMs=512, sendMessageMs=2002, defaultAspectsCreated=0, lastUrn=, lastAspect=)
    
    # New Code
    2024-09-04 13:28:38,668 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - metrics so far RestoreIndicesResult(ignored=0, rowsMigrated=5000, timeSqlQueryMs=438, timeGetRowMs=0, timeUrnMs=11, timeEntityRegistryCheckMs=9, aspectCheckMs=12, createRecordMs=3, sendMessageMs=29947, defaultAspectsCreated=0, lastUrn=, lastAspect=)
    

    The timeSqlQueryMs value increases with each batch which makes it seem like it gets slower over time as well since it doesn't seem like a cumulative metric and should just be for the particular batch.

    # Later batch (note this is with the step 6 mentioned earlier commented out)
    2024-09-04 14:30:19,396 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - metrics so far RestoreIndicesResult(ignored=0, rowsMigrated=785000, timeSqlQueryMs=389751, timeGetRowMs=0, timeUrnMs=89, timeEntityRegistryCheckMs=176, aspectCheckMs=266, createRecordMs=33, sendMessageMs=39263, defaultAspectsCreated=0, lastUrn=, lastAspect=)
    

    As of now I'm not sure the cause for the slowdown with the SQL query but it is something that needs to be improved.

Note that these comments only apply to using urnBasedPagination and I haven't looked into the degrading performance over time when having that set to false with threads.

@nmbryant
Copy link
Contributor Author

nmbryant commented Sep 4, 2024

Actually, it looks like the overall performance was pretty close to the old code once the MCP step was commented out. So these are the changes that I would propose:

  1. Short term solution to have a flag to disable MCP events in RestoreIndices process
  2. Long term solution to improve performance of sending the MCP events

Another issue I'm getting with the new code is this timeout error:

java.util.concurrent.TimeoutException: Exceeded request timeout of 10000ms

Anyone know which variable I can change to increase this timeout? This doesn't happen with the old code and I am able restore the entire dev environment with 9M aspects.

@vejeta
Copy link
Contributor

vejeta commented Oct 21, 2024

AFAIK, the timeout had a default time that was removed a month ago here:

https://github.com/datahub-project/datahub/pull/11456/files#diff-4c4357beca85c5d52050b7923d49c04753f291f51867a06b4e6ea69ff1f0a0c9R219

After a reported similar issue: #10557

If you are in an older version of DataHub, and don´t want to upgrade yet. You can change it here:
https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ESIndexBuilder.java#L150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

2 participants