-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datahub-upgrade RestoreIndices process performance is poor on large sets of data #11276
Comments
After some further analysis I was able to make some progress. I was able to restore the performance by adding the original restoreIndices function to EntityServiceImpl and adding getPagedAspects to EbeanAspectDao. Comparing the results between these changes and what's currently in main, there appear to be 2 parts that are significantly slower with the newer code.
Note that these comments only apply to using urnBasedPagination and I haven't looked into the degrading performance over time when having that set to false with threads. |
Actually, it looks like the overall performance was pretty close to the old code once the MCP step was commented out. So these are the changes that I would propose:
Another issue I'm getting with the new code is this timeout error:
Anyone know which variable I can change to increase this timeout? This doesn't happen with the old code and I am able restore the entire dev environment with 9M aspects. |
AFAIK, the timeout had a default time that was removed a month ago here: After a reported similar issue: #10557 If you are in an older version of DataHub, and don´t want to upgrade yet. You can change it here: |
Describe the bug
The datahub-upgrade RestoreIndices process significantly degrades as it runs when running against a large set of data. In our lower environment we have 9 million aspects and in our other environments we have 50+ million. With urnBasedPagination set to false, threads set to 10 and running in dev on all data, it was able to process 1 million in 20 minutes which was reasonably. However, as it runs the estimate to complete starts getting longer and longer with each batch taking significantly more time. After running for over 12 hours, it had only processed 4 million aspects before I stopped it.
Because of this, back in December we implemented urnBasedPagination which had linear performance regardless of the number of aspects. When we implemented it, RestoreIndices was able to process all 9M aspects in the dev environment in less than an hour. However, after upgrading to v0.12.1 from v0.10.5 the process is much slower with urnBasedPagination although still linear execution time, taking 1 hour to process 600K aspects.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Expect the process when threading and urnBasedPagination to false to be able to run against a large set of data in a reasonable amount of time. As it is now, it is not usable for large amounts of data.
With urnBasedPagination set to true, would like the performance restored to what it was when we first implemented it.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
My guess is that it's the change from PagedList to stream in EbeanAspectDao.
This is the version of EbeanAspectDao that is performant for us: https://github.com/nmbryant/datahub/blob/3ed6102d875b40452354d4044d7b89724c702ab6/metadata-io/src/main/java/com/linkedin/metadata/entity/ebean/EbeanAspectDao.java#L464
From my testing, it appears that it became slower after this change: 9a0a53b
@david-leifker What was the reason for switching it to a stream and do you think I would be able to easily switch back to a PagedList to see if that would improve the performance?
The text was updated successfully, but these errors were encountered: