-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache call to path_to_url #12322
Closed
Closed
Cache call to path_to_url #12322
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
967607d
Cache call to path_to_url
notatallshaw d1db13f
Add news entry
notatallshaw 7beea44
Can only cache after making absolute
notatallshaw 8466261
Fix news entry
notatallshaw 560478f
End of line on news
notatallshaw 7349bf4
Merge branch 'main' into path_to_url_cache
notatallshaw File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Improve performance ~17% when installing many wheels offline |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unbounded and would store information that is not used more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but what other solution is there?
We can't know ahead of time how many file paths need caching.
If a maxsize is given it is completely arbitrary. If you think it's required for memory safety I would prefer a very large number that is unexpected to be reached, like 10'000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said above
Agreed, it's complex, but that may be better than throwing memory at the problem - after all, pip does get used in memory-constrained environments. I don't know what
lru_cache
does when it's getting close to memory limits, but I doubt it tries to manage that situation particularly - so you'd probably at some point start to get paging and a significant reduction in performance.No, but it's not a matter of needing to cache anything. It's simply a case of only getting some of the performance benefits, not all of them.
Ultimately, we need to balance different use cases here. I'd consider installing 1000 wheels in a single install to be a very extreme case, and honestly I don't think 2.5 minutes is a particularly bad time for that. So while I'm always glad if we can get performance improvements, I think we have to be careful to keep perspective here. The overwhelming majority of the uses of pip "in the wild" are likely to be of the form
pip install <one or two packages>
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, maybe we can just speed up
path_to_url
? Looking at the following:suggests that using
Path.as_uri()
is a lot faster. UsingPath.absolute()
rather thanos.path
loses a lot of the gain, I'm not sure why - maybe because it calls thePath
constructor twice.The point is, there may well be other options than caching the results. Or there may be improvements that can be achieved as well as a (limited-size) cache. As with any performance exercise, it's all about trade offs.
Edit: The same tests on Ubuntu (WSL) don't give the same improvements for the pathlib approach. Make of that what you will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Home assistant is a very popular application in the smart home world, so it's relatively common use case.
And it's 2.5 minutes in my machine, it's 1-2 hours on others people's: #12314. This was going to my first in a series of PRs.
And the cache here only grows if there are a lot of wheels, so there's only a few kilobytes of memory used in non-"extreme" examples.
If a user needs to install over 1000 wheels using Pip I have to assume they have 1 or 2 MBs of spare memory.
I'll take another look at rearchitecture approach, but it probably means a significant rework of the way pip and resolvelib interact with each other. And my worry is that even I was able to make a PR there's a good chance it would never be accepted as a non-Pip maintainer adding such a significant requirement of knowledge for maintenance. Or that it would be rejected by resolvelib for breaking other downstream consumers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've been profiling this: #12314 (comment) (path_to_url is the far most left light greeny blue box).
There are lots of other hot spots in this profile graph, but I just thought I'd start with the most simple looking offender.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two big callers are
file_links()
andpage_candidates
in_FlatDirectorySource
, and they do the same loop. So maybe put the result of that loop in a cached property?(Add type annotations as needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try that locally and if successful create a new PR, unless you are wanting to given you have already wrote some code on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, go ahead. I haven't got the patience to work out the right type annotations :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this approach doesn't work immediately because each call to
page_candidates
andfile_links
is from a seperate instance of_FlatDirectorySource
. However there a number of possible solutions here, I will come up with one and submit a new PR.