Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache call to path_to_url #12322

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions news/12322.bugfix.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Improve performance ~17% when installing many wheels offline
12 changes: 11 additions & 1 deletion src/pip/_internal/utils/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import string
import urllib.parse
import urllib.request
from functools import lru_cache
from typing import Optional

from .compat import WINDOWS
Expand All @@ -13,13 +14,22 @@ def get_url_scheme(url: str) -> Optional[str]:
return url.split(":", 1)[0].lower()


@lru_cache(maxsize=None)
Copy link
Member

@pradyunsg pradyunsg Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unbounded and would store information that is not used more than once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but what other solution is there?

We can't know ahead of time how many file paths need caching.

If a maxsize is given it is completely arbitrary. If you think it's required for memory safety I would prefer a very large number that is unexpected to be reached, like 10'000.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but what other solution is there?

As you said above

An alternative solution would be to rearchitect Pip to not need to call path_to_url so much

Agreed, it's complex, but that may be better than throwing memory at the problem - after all, pip does get used in memory-constrained environments. I don't know what lru_cache does when it's getting close to memory limits, but I doubt it tries to manage that situation particularly - so you'd probably at some point start to get paging and a significant reduction in performance.

We can't know ahead of time how many file paths need caching.

No, but it's not a matter of needing to cache anything. It's simply a case of only getting some of the performance benefits, not all of them.

Ultimately, we need to balance different use cases here. I'd consider installing 1000 wheels in a single install to be a very extreme case, and honestly I don't think 2.5 minutes is a particularly bad time for that. So while I'm always glad if we can get performance improvements, I think we have to be careful to keep perspective here. The overwhelming majority of the uses of pip "in the wild" are likely to be of the form pip install <one or two packages>.

Copy link
Member

@pfmoore pfmoore Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, maybe we can just speed up path_to_url? Looking at the following:

❯ pyperf timeit -s "from pathlib import Path; import os" "Path(os.path.normpath(os.path.abspath('.'))).as_uri()"
.....................
Mean +- std dev: 5.67 us +- 0.13 us
❯ pyperf timeit -s "from urllib.parse import urljoin; from urllib.request import pathname2url; import os" "urljoin('file:', pathname2url(os.path.normpath(os.path.abspath('.'))))"
.....................
Mean +- std dev: 9.23 us +- 0.34 us

suggests that using Path.as_uri() is a lot faster. Using Path.absolute() rather than os.path loses a lot of the gain, I'm not sure why - maybe because it calls the Path constructor twice.

The point is, there may well be other options than caching the results. Or there may be improvements that can be achieved as well as a (limited-size) cache. As with any performance exercise, it's all about trade offs.

Edit: The same tests on Ubuntu (WSL) don't give the same improvements for the pathlib approach. Make of that what you will.

Copy link
Member Author

@notatallshaw notatallshaw Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider installing 1000 wheels in a single install to be a very extreme case, and honestly I don't think 2.5 minutes is a particularly bad time for that.

Home assistant is a very popular application in the smart home world, so it's relatively common use case.

And it's 2.5 minutes in my machine, it's 1-2 hours on others people's: #12314. This was going to my first in a series of PRs.

And the cache here only grows if there are a lot of wheels, so there's only a few kilobytes of memory used in non-"extreme" examples.

Agreed, it's complex, but that may be better than throwing memory at the problem - after all, pip does get used in memory-constrained environments.

If a user needs to install over 1000 wheels using Pip I have to assume they have 1 or 2 MBs of spare memory.

I'll take another look at rearchitecture approach, but it probably means a significant rework of the way pip and resolvelib interact with each other. And my worry is that even I was able to make a PR there's a good chance it would never be accepted as a non-Pip maintainer adding such a significant requirement of knowledge for maintenance. Or that it would be rejected by resolvelib for breaking other downstream consumers.

Copy link
Member Author

@notatallshaw notatallshaw Oct 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If you have spent this time, let us know -- I might've missed information around this!)

Yes, I've been profiling this: #12314 (comment) (path_to_url is the far most left light greeny blue box).

There are lots of other hot spots in this profile graph, but I just thought I'd start with the most simple looking offender.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've been profiling this

The two big callers are file_links() and page_candidates in _FlatDirectorySource, and they do the same loop. So maybe put the result of that loop in a cached property?

class _FlatDirectorySource(LinkSource):
    def __init__(
        self,
        candidates_from_page: CandidatesFromPage,
        path: str,
    ) -> None:
        self._candidates_from_page = candidates_from_page
        self._path = pathlib.Path(os.path.realpath(path))
        self._file_urls = None

    @property
    def link(self) -> Optional[Link]:
        return None

    def _scan_dir(self):
        if self._file_urls is None:
            _file_urls = []
            for path in self._path.iterdir():
                url = path_to_url(str(path))
                _file_urls.append((url, _is_html_file(url)))
            self._file_urls = _file_urls

    def page_candidates(self) -> FoundCandidates:
        self._scan_dir()
        for url, html in self._file_urls:
            if html:
                yield from self._candidates_from_page(Link(url))

    def file_links(self) -> FoundLinks:
        self._scan_dir()
        return (Link(url) for url, html in self._file_urls if not html)

(Add type annotations as needed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try that locally and if successful create a new PR, unless you are wanting to given you have already wrote some code on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, go ahead. I haven't got the patience to work out the right type annotations :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this approach doesn't work immediately because each call to page_candidates and file_links is from a seperate instance of _FlatDirectorySource. However there a number of possible solutions here, I will come up with one and submit a new PR.

def _normalized_abs_path_to_url(abs_path: str) -> str:
"""
Convert a normalized absolute path to a file: URL.
"""
url = urllib.parse.urljoin("file:", urllib.request.pathname2url(abs_path))
return url


def path_to_url(path: str) -> str:
"""
Convert a path to a file: URL. The path will be made absolute and have
quoted path parts.
"""
path = os.path.normpath(os.path.abspath(path))
url = urllib.parse.urljoin("file:", urllib.request.pathname2url(path))
url = _normalized_abs_path_to_url(path)
return url


Expand Down