This repository contains bazel rules for assembling a conda environment.
These rules
- Generate a lock file based on a package requirements list.
- Build a conda environment from that lock file.
- Support depending on just the subset of that environment needed by your build target.
- Register the
python
in that environment as a bazel python toolchain. - Support using
conda
packages a C/C++ dependencies.
Add the following to your workspace:
load(
"@bazel_tools//tools/build_defs/repo:git.bzl",
"git_repository",
)
git_repository(
name = "com_github_10XGenomics_rules_conda",
remote = "https://github.com/10XGenomics/rules_conda.git",
)
load(
"@com_github_10XGenomics_rules_conda//:deps.bzl",
"rules_conda_dependencies",
)
rules_conda_dependencies()
load(
"@tenx_bazel_rules//:deps2.bzl",
"second_level_dependencies",
)
# Transitive dependencies and repository helper.
second_level_dependencies()
load("@io_bazel_rules_go//go:deps.bzl", "go_register_toolchains")
# Go
go_register_toolchains(version = "1.23.1")
Next, in the build file for whatever package you wish
(conventionally third-party/conda
),
add a file listing requirements in a form that micromamba create
will
understand, e.g.
coverage>=6.3
cython>=0.29
numpy
python>=3.8.8
and a rule
load("@com_github_10XGenomics_rules_conda//rules:conda_package_lock.bzl", "conda_package_lock")
conda_package_lock(
name = "generate_package_lock",
channels = [
"conda-forge",
],
requirements = "conda_requirements.txt",
target = "conda_env.bzl",
visibility = ["//visibility:public"],
)
See the documentation for that rule for additional options you might want to set for it.
bazel run
that target to generate the conda_env.bzl
file, and then add
load("//third-party/conda:conda_env.bzl", "conda_environment")
conda_environment()
to your WORKSPACE
.
You may now build @conda_env//:conda_env
(a convenience target for building all packages in the environment.
Usually you don't want other bazel targets to depend on this).
It may not work on the first try, as it is common for conda metadata to have
errors which need to be manually corrected (see below).
The conda_environment
workspace macro will automatically register the conda
python package as the python toolchain for bazel.
Python targets can depend on packages from that environment as e.g.
py_library(
name = "uses_numpy",
srcs = "uses_numpy.py",
deps = [
"@conda_env//:numpy",
],
)
This will automatically pull in transitive dependencies as well.
You can also depend on packages from a cc_library
,
if the package contains C/C++ headers or libraries.
This will not pull in transitive compile-time or link-time dependencies,
however.
Conda packages do not make a distinction between a dependency
that is used at runtime, e.g. as a subprocess,
versus one that supplies headers or libraries necessary at compile/link time.
You probably don't want to be linking against libpython
just because you
needed to link against a package that contains a few python scripts.
And, if you do, adding it explicitly is easier than "un-adding" it in the cases
where you don't.
Transitive dependencies will still be propagted as if they were data
dependencies, however.
If you want to be able to use a conda package as an executable target
(e.g. with bazel run
or in the tools
of a genrule
),
add it to the executable_packages
of the conda_environment_repository
rule.
In addition to this making the package target executable, this will also cause two additional targets to be created:
@conda_env//:<package_name>_exe_file
which has only the the executable file as an output (so you can use it in a$(location)
expansion in agenrule
, for example)@conda_env//:<package_name>_exe
, which adds back the rest of the files as runfiles.
It is common for at least one package to have issues.
The most common problem is an invalid or unrecognized license declaration. That can be overridden by adding e.g.
licenses = ["@rules_license//licenses/spdx:LGPL-2.1"],
to the appropriate conda_package_repository
declaration in your
conda_env.bzl
.
Note
These bazel rules make a best-effort attempt to parse the license metadata in the conda package, however it may misinterpret that metadata, or the metadata could be simply incorrect. These rules do not offer legal advice.
Package metadata will frequently include an incorrect path to the license file,
or simply not specify a path at all1.
In this situation, you can override the metadata by using the
license_file
attribute.
The most common failure mode of this type is for the license file to actually
be at info/licenses/LICENSE0.txt
.
Sometimes two packages will declare the same file.
One very common example is packages carelessly including
lib/python*/__pycache__/_sysconfigdata*.pyc
.
You can add glob patterns like that for files or directories into the
exclude
attribute on a conda_package_repository
,
in which case those files will be skipped.
If you wish, you can also use exclude
to get rid of unnecessary content like
man pages or test suites, which might otherwise unnecessarily bloat your build.
The conda_package_repository
supports the same attributes as http_archive
for patching.
Sometimes you may wish to use a conda package as a build dependency for a
cc_library
target.
The rules will make a best-effort attempt to guess the include path,
but in some cases you may need to override it, e.g.
cc_include_path = ["include/eigen3/Eigen"]
Some packages will declare dependencies they don't actually need
(at least for your use case),
or may have undeclared optional dependencies which they do need for your
use cases.
The exclude_deps
and extra_deps
attributes can be used to
cut or add dependency edges.
The conda_package_lock
rule also has an exclude
attribute
which will cause the tool to ignore packages from the results of the solve.
It will automatically add exclude_deps
attributes for any packages which
depend on one of those.
Sometimes the name of a conda package doesn't match up with the name of a
python import or corresponding pip
package
(e.g. scikit-learn
vs. sklearn
).
For convenience, the conda_environment_repository
can take a dictionary
of aliases to include in the generated BUILD
file.
It will automatically create aliases with -
replaced by _
,
as this is a common transformation.
This automatic aliasing may cause problems in cases
where the upstream conda packaging attempted to solve this problem by creating
e.g. both importlib-metadata
and importlib_metadata
.
Generally one of those is an empty package that just depend on the other.
If that happens, exclude
the empty package in the conda_package_lock
rule and add an alias manually.
This is a fork of an internal version2 of the rules which we've been using in production for years. They've worked very well for us, and we don't forsee making major changes any time soon. So, yes (though this should not be construed as contradicting the license terms which make clear that we provide no warranty).
Yes! If you include the coverage
package in your environment,
it'll automatically be added to the generated python toolchain definition.
You can try this out in this repo - bazel coverage //...
will produce
bazel-out/_coverage/_coverage_report.dat
with coverage of the
go, python, and c++ targets.
On the roadmap, not there yet.
bzlmod
version resolution rules guarantee that you'll always get the same
package resolutions for the same MODULE.bazel
(with caveats around local configuration).
The conda package solver makes no such guarantee, so a lock file is essential
for build reproducibility.
The solver can also be quite slow, on the order of minutes for more complex environments. The lock file isn't difficult to manage. We have a github action that periodically runs the solver opens a PR to update the lock file.
In many ways this is complementary with rules_python
.
It is probably not compatible with pip_install
, however.
pip
, and therefore pip_install
by extension,
is somewhat inherently problematic for reproducible builds,
and it won't integrate well with the generated conda environment.
Also, the conda ecosystem is not exclusively about python. We have internal repositories which don't use python at all, but which use these rules to obtain C/C++ dependencies.
If you need a package that can't be found in conda
,
you can get it some other way (e.g. from a .whl
or building from source)
and inject it into the generated conda environment using a e.g.
new_conda_package_http_repository
.
Add it to extra_packages
in the conda_package_lock
target to
make the generated conda_environment_repository
include it.
By generating separate repositories for each package, you get in some cases much faster build times because fetching of a package can go in on parallel with analysis of targets which don't depend transitively on that package.
Really the conda_package_repository
calls should be thought of as private
implementation details.
Most of the time you only need to interact with @conda_env
.
On the roadmap.
There's nothing fundamental preventing it,
but it hasn't been a priority for us to work on it.
Parts of it, such as work on other platforms, including
conda_package_lock
,
so you can update the conda_environment.bzl
from your aarch64 Macbook,
but at the moment only if your target platform is still linux on amd64
due to a few hard-coded assumptions here and there.
Unfortunately, cross compilation doesn't really work at the moment due to
(ironically) the need to support noarch
packages.
We need to be able to execute the python executable in order to ask it
the correct path for lib/pythonX.Y
to prepend for those packages.
We could avoid this if we were willing to just assume that the answer is
always lib/pythonX.Y
where X.Y
is always the major/minor version number,
but at least so far our need for supporting cross-compilation has been
insufficient to justify making such an assumption.
First of all, conda is not exclusively a Python ecosystem.
Some aspects of putting together an environment are simply too complicated
to do purely in starlark.
go
has several desirable properties for performing those steps:
- Go is more portable than python. No, really. For repository rules, you can't use a build target, so you're basically stuck using the system python, which could be anything, or even nothing. Once you've got your nice python environment set up using those rules, great! But bootstrapping still needs to happen.
- Go is very fast to compile. At least once you have go's module cache populated, it can compile the tools necessary for building the repo in a fraction of a second.
- Go is pretty fast at runtime, too. It doesn't prioritize runtime performance the way e.g. rust does, but it's still faster than pretty much any scripting language. Even if you don't have hundreds of packages in your conda environment, the time it takes to compile the go-based repository helper executable used to generate the BUILD files is going to be far less than the it saves by running a compiled executable rather than an interpreted script.
- The canonical tool for formatting BUILD files,
buildifier
, is written in go. The libraries backing it can be used to generate build files programmatically. - The bazel rules for go are very mature. The SDK repository rule knows how to use the host go, or download the SDK from the internet, and then, critically, it exposes that SDK in a way that can be used from another repository rule.
- Subjective reasons. Personal preferences of the rule authors.
In our internal monorepos which use this rule set, we're using go for other reasons anyway, so it's a non-issue. If you aren't otherwise using go, then needing to make a call to configure the go toolchain is understandably a little annoying. We don't expect the extra two lines of boilerplate in your WORKSPACE to be a deal-breaker for anyone.
Footnotes
-
In theory this would be fine, but there is a bug in rules_license which means that the license rule does not work properly if a license file is not provided. ↩
-
The main differences between this public version of the rules and the internal version are due to divergence between our internal fork of
rules_license
and the public one. We're hoping we can eventually upstream enough changes torules_license
to allow us to eliminate those differences. ↩