Skip to content

Implement Lazy Loading of Submodules for faster import #3732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 27 commits into from

Conversation

arjxn-py
Copy link
Member

@arjxn-py arjxn-py commented Jan 16, 2024

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #3490

Type of change

Please add a line in the relevant section of CHANGELOG.md to document the change (include PR #) - note reverse order of PR #s. If necessary, also add to the list of breaking changes.

  • New feature (non-breaking change which adds functionality)
  • Optimization (back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist:

  • No style issues: $ pre-commit run (or $ nox -s pre-commit) (see CONTRIBUTING.md for how to set this up to run automatically when committing locally, in just two lines of code)
  • All tests pass: $ python run-tests.py --all (or $ nox -s tests)
  • The documentation builds: $ python run-tests.py --doctest (or $ nox -s doctests)

You can run integration tests, unit tests, and doctests together at once, using $ python run-tests.py --quick (or $ nox -s quick).

Further checks:

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

@arjxn-py arjxn-py added the infrastructure Packaging, distribution, and releases label Jan 16, 2024
@arjxn-py arjxn-py marked this pull request as draft January 16, 2024 10:27
Copy link

codecov bot commented Jan 16, 2024

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (4484514) 99.59% compared to head (2b979af) 99.56%.

❗ Current head 2b979af differs from pull request most recent head ec571ba. Consider uploading reports for the commit ec571ba to get more accurate results

Files Patch % Lines
pybamm/util.py 72.22% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3732      +/-   ##
===========================================
- Coverage    99.59%   99.56%   -0.03%     
===========================================
  Files          258      258              
  Lines        20823    20839      +16     
===========================================
+ Hits         20738    20749      +11     
- Misses          85       90       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@arjxn-py
Copy link
Member Author

arjxn-py commented Jan 16, 2024

Tried handling wildcard (i.e. *) imports with this function but wasn't able to, have to explicitly define each attribute to be imported separately (for expression_tree mostly), Would be grateful to have some suggestions here.

@agriyakhetarpal
Copy link
Member

agriyakhetarpal commented Jan 16, 2024

I think the importlib.util or pkgutil modules from the Python standard library might have something that we could use. Ideally, we should define everything that should be available to the user namespace in a list named __all__ for every package and module, we haven't done that but this PR could be a good start for that, PyBaMM-wide.

This might or might not help to construct a function whose output can provide lazy_loader with importable paths for all submodules:

import pkgutil
import pybamm

print([i for i in pkgutil.walk_packages(pybamm.__path__)])

returns

paths to modules under pybamm/

[ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='batch_study', ispkg=False),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='callbacks', ispkg=False),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='citations', ispkg=False),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='discretisations', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='doc_utils', ispkg=False),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='experiment', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='expression_tree', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='geometry', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='input', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='install_odes', ispkg=False),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='logger', ispkg=False),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='meshes', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='models', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='parameters', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='parameters_cli', ispkg=False),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='plotting', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='settings', ispkg=False),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='simulation', ispkg=False),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='solvers', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='spatial_methods', ispkg=True),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='util', ispkg=False),
 ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm'), name='version', ispkg=False)]

You should be able to run this on packages under pybamm.* too. This is under pybamm.expression_tree as an example:

print([i for i in pkgutil.walk_packages(pybamm.expression_tree.__path__)])

returns

For the expression tree

ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='array', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='averages', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='binary_operators', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='broadcasts', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='concatenations', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='exceptions', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='functions', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='independent_variable', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='input_parameter', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='interpolant', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='matrix', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='operations', ispkg=True)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='parameter', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='printing', ispkg=True)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='scalar', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='state_vector', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='symbol', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='unary_operators', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='variable', ispkg=False)
ModuleInfo(module_finder=FileFinder('/Users/agriyakhetarpal/Desktop/PyBaMM/pybamm/expression_tree'), name='vector', ispkg=False)

and you can now manipulate this import machinery to describe your own module paths by joining the module_finder and name attributes for pybamm.util.lazy_loader! This might or might not work for our case (but at least TBH this is a great start for diving into all of the core details and the fundamentals of Pythonic imports, and I managed to learn something too!).

Note: (there might be a better solution of course, I was just reporting on what I have found by now)

@agriyakhetarpal
Copy link
Member

I was reading up on making imports faster and I think we don't need tor essentially redefine the import pybamm mechanism by using lazy_loader everywhere, just making it faster would be great in itself. i.e., we can segregate what is causing a slower import and target that module in specific for a lazy-import, while keeping the other imports as is. For example, something like

from .expression_tree.operations.jacobian import Jacobian

or

from .util import (
    get_parameters_filepath,
    have_jax,
    install_jax,
    have_optional_dependency,
    is_jax_compatible,
    get_git_commit_info,
)

will never be slow, but something like this

from .models.submodels import (
    active_material,
    convection,
    current_collector,
    electrolyte_conductivity,
    electrolyte_diffusion,
    electrode,
    external_circuit,
    interface,
    oxygen_diffusion,
    particle,
    porosity,
    thermal,
    transport_efficiency,
    particle_mechanics,
    equivalent_circuit_elements,
)

has too many things going on in one statement. I figure that it will be just the *-imports and those for the battery models like the above one that are currently causing the bottleneck, so it would be great if we can do some profiling in this PR to gauge what parts are slow – we can explicitly focus on those and leave the rest (and leave out the parts that cause unit and integration tests to fail – just a case of tedious trial and error).

It might be worth scouting SciPy to see how they manage it (despite being a large monorepo, import scipy takes just about a second at most). In the footnotes for SPEC-0001, I found this project that can help automate a few things related to *-imports: https://github.com/Erotemic/mkinit

N.B. For profiling to be effective, we will have to delete the __pycache__ directories and their files (compiled bytecode) everywhere since Python imports get cached very efficiently.

@arjxn-py
Copy link
Member Author

Thanks a lot @agriyakhetarpal, i'm looking accordingly to your suggestion into the segregation first and getting back in some time.

@agriyakhetarpal
Copy link
Member

I might have solved it: for modules that further have sub-modules (pybamm.models.*), we might need to modify every __init__.py file, not just the base one. I tried this command in the root directory which set everything up for me automatically:

mkinit --lazy_loader --inplace --recursive pybamm  # also accepts --noall so as to not add __all__ in the files

and I presumably removed the import caches (all __pycache__ folders) recursively by doing

rm -rf $(find . -type d -name '__pycache__')

But as I understand, this still does not remove all caches – so %timeit importlib.import_module("pybamm") in a interactive REPL might be returning an incorrect calculation and could be unreliable. It does feel faster, though – but it could be a case of placebo.

Outputs from importing PyBaMM

Without lazy_loader

%timeit importlib.import_module("pybamm")

The slowest run took 13.77 times longer than the fastest. This could mean that an intermediate result is being cached.
2.33 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

With lazy_loader

%timeit importlib.import_module("pybamm")

The slowest run took 11.64 times longer than the fastest. This could mean that an intermediate result is being cached.
2.29 µs ± 2.84 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

This isn't helpful TBH, so I started deleting the entries from the sys.modules dictionary:

import sys
keys_to_delete = [key for key in sys.modules if 'pybamm' in key]
for key in keys_to_delete:
    del sys.modules[key]

in addition to removing the __pycache__ directories, but I haven't found a substantial increase yet. Maybe this can help act as a precursor for your experiments as we go further. A reliable method, though a bit tedious, would be to re-clone the repository, re-install it from source, apply the lazy_loader changes, and then benchmark the import pybamm statement – maybe try that and see what you get?

P.S. You can consider putting out a testimonial (scientific-python/lazy-loader#50) after we manage to do this

@arjxn-py
Copy link
Member Author

arjxn-py commented Jan 21, 2024

This does the work, thanks for suggesting @agriyakhetarpal.
The import for me is superfast now as compared to before 🎉
I tested it however I could by iteratively re-cloning PyBaMM again and again and also in new environments including Python 3.9, 3.10, 3.11 & 3.12. The results below are from 3.12 -

Without lazy-import :

The slowest run took 7.07 times longer than the fastest. This could mean that an intermediate result is being cached.
58 µs ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

With lazy-import :

The slowest run took 14.75 times longer than the fastest. This could mean that an intermediate result is being cached.
2.74 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

However, there are a number of CI failures raised due to big change which would be fixed iteratively.

@arjxn-py
Copy link
Member Author

But as I understand, this still does not remove all caches – so %timeit importlib.import_module("pybamm") in a interactive REPL might be returning an incorrect calculation and could be unreliable. It does feel faster, though – but it could be a case of placebo.

I am also encountering an issue with %timeit importlib.import_module("pybamm") returning an incorrect calculation(might be due to cache), however I can easily recognize a clear difference with or without lazy-import while importing PyBaMM.

@arjxn-py
Copy link
Member Author

arjxn-py commented Jan 21, 2024

Now as I'm trying to fix CI failures, I have first tried to resolve imports in the __init__.py files but I am realizing that it is leading to again defining imports like before recursively plus it is not ideal.

Other way around is to fix the imports in code i.e

  • pybamm.Symbol should be pybamm.expression_tree.Symbol
  • pybamm.Parameter should be pybamm.expression_tree.parameter.Parameter
  • pybamm.multiply should be pybamm.expression_tree.binary_operators.multiply
  • & so on

But this approach is leading to big API change. So I'd be more than happy to have suggestions here & also would like to know if i'm missing any page and can also try something else instead.

@valentinsulzer
Copy link
Member

Definitely don't make that API change. Also having to keep the __init__.py files up to date like that will make development more challenging. What are the benefits of lazy loading that would justify this?

@agriyakhetarpal
Copy link
Member

Usually import pybamm can take you 15 to 20 seconds in a Jupyter notebook (mostly the case for me) – lazy loading is supposed to just import PyBaMM and then import the modules under it at runtime dynamically when they are first used in a script.

I don't think we are using lazy_loader correctly, since it was defined for scenarios like this (to not bring a breaking change in terms of the public API). Can we skip lazy loading pybamm.Symbol and others (whatever ones break the tests)?

@arjxn-py
Copy link
Member Author

arjxn-py commented Jan 21, 2024

Without lazy-import :

The slowest run took 7.07 times longer than the fastest. This could mean that an intermediate result is being cached.
58 µs ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

With lazy-import :

The slowest run took 14.75 times longer than the fastest. This could mean that an intermediate result is being cached.
2.74 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

One known benefit of lazy-loading is reduced import time and it is also expected to improve performance as Many of the Calculated attributes or attributes that are loaded are using an expensive operation i.e. from .expression_tree.symbol import *

@arjxn-py
Copy link
Member Author

Can we skip lazy loading pybamm.Symbol and others (whatever ones break the tests)?

I've tried doing that but it was leading more imports to skip lazy loading in a chain but yes we can try this to reach at point where there are no or minimum failures and least import time.

@arjxn-py
Copy link
Member Author

Closing this as discussed in the last developer meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Packaging, distribution, and releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable lazy loading of submodules
4 participants