(Mascot courtesy Nictrain123 .)
A simple interface to allow electrostatic embedding of machine learning potentials using an ORCA-like interface. Based on code by Kirill Zinovjev. An example sander implementation is provided. This works by reusing the existing interface between sander and ORCA, meaning that no modifications to sander are needed. An OpenMM interface is also available via Sire. The embedding model currently supports the HCNOS elements. We plan to add support for further elements in the near future.
Further details can be found in our paper, available here. Please
cite this work if you use emle-engine
in your research. Supplementary
information and data can be found here. For
the original theory behind EMLE, please refer to this
publication.
We thank EPSRC for funding (grant code EP/V011421/1).
First create a conda environment with all of the required dependencies:
conda env create -f environment.yaml
conda activate emle
If this fails, try using mamba as a replacement for conda.
For GPU functionality, you will need to install appropriate CUDA drivers on
your host system along with NVCC, the CUDA compiler driver. (This doesn't come
with cudatoolkit
from conda-forge
.)
(Depending on your CUDA setup, you might need to prefix the environment creation
command above with something like CONDA_OVERRIDE_CUDA="11.2"
to resolve an
environment that is compatible with your CUDA driver.)
Finally, install emle-engine
:
pip install .
If you are developing and want an editable install, use:
pip install -e .
To start an EMLE calculation server:
emle-server
For usage information, run:
emle-server --help
(By default, an emle_settings.yaml
file will be written to the working directory
containing the settings used to configure the server. This can be used to re-run
an existing simulation using the --config
option or EMLE_CONFIG
environment
variable. Additional emle_pid.txt
and emle_port.txt
files contain the process
ID and port of the server.)
To launch a client to send a job to the server:
orca orca_input
Where orca_input
is the path to a fully specified ORCA input file. In the
examples given here, the orca
executable will be called by sander
when
performing QM/MM, i.e. we are using a fake ORCA as the QM backend.
(Alternatively, just running orca orca_input
will try to connect to an existing
server and start one for you if a connection is not found.)
The server and client should both connect to the same hostname and port. This
can be specified in a script using the environment variables EMLE_HOST
and
EMLE_PORT
. If not specified, then the same default values will be used for
both the client and server.
To stop the server:
emle-stop
The EMLE
model uses Atomic Environment Vectors (AEVs) for the calculation of
the electrostatic embedding energy. For performance, it is desirable to use
the optimised symmetry functions provided by the NNPOps
package. This requires a static compute graph, so needs to know the atomic
numbers for the atoms in the QM region in advance. These can be specified using
the EMLE_ATOMIC_NUMBERS
environment variable, or the --atomic-numbers
command-line argument when launching the server. This option should only be
used if the QM region is fixed, i.e. the atoms in the QM region do not change
each time the calculator is called.
The embedding method relies on in vacuo energies and gradients, to which
corrections are added based on the predictions of the model. At present we
support the use of
TorchANI,
DeePMD-kit,
ORCA,
SQM,
XTB, or
PySander
for the backend, providing reference MM or QM with EMLE embedding, and pure EMLE
implementations. To specify a backend, use the --backend
argument when launching
emle-server
, e.g:
emle-server --backend torchani
(The default backend is torchani
.)
When using the orca
backend, you will also need to specify the path to the
real orca
exectubale using the --orca-path
command-line argument, or the
EMLE_ORCA_PATH
environment variable. (To check that EMLE is running, look for
a log or settings file in the working directory.) The input for orca
will
be taken from the &orc
section of the sander
configuration file, so use this
to specify the method, etc.
When using deepmd
as the backend you will also need to specify a model
file to use. This can be passed with the --deepmd-model
command-line argument,
or using the EMLE_DEEPMD_MODEL
environment variable. This can be a single file, or
a set of model files specified using wildcards, or as a comma-separated list.
When multiple files are specified, energies and gradients will be averaged
over the models. The model files need to be visible to the emle-server
, so we
recommend the use of absolute paths.
When using pysander
or sqm
as the backend you will also need to specify the
path to an AMBER parm7 topology file for the QM region. This can be specified
using the --parm7
command-line argument, or via the EMLE_PARM7
environment
variable.
We also provide a flexible way of supporting external backends via a callback function that can be specified via:
emle-server --external-backend module.function
The function
should take a single argument, an ase.Atoms
object for the QM region, and return the energy in Hartree as a float along
with the gradients in Hartree/Bohr as a numpy.ndarray
. The external backend
can also be supplied using the EMLE_EXTERNAL_BACKEND
environment variable.
When set, the external backend will take precendence over any other backend.
If the callback is a function within a local module, then make sure that the
directory containing the module is in sys.path
, or is visible to the emle-server
,
e.g. the server is launched from the same directory as the module. Alternatively,
use --plugin-path
to specify the path to a directory containing the module.
This can also be specified using the EMLE_PLUGIN_PATH
environment variable.
Make sure that this is an absolute path so that it is visible to the server
regardless of where it is launched.
We also support the use Rascal for the calculation of delta-learning corrections to the in vacuo energies and gradients. To use, you will first need to create an environment with the additional dependencies:
conda env create -f environment_rascal.yaml
conda activate emle-rascal
(These are not included in the default environment as they limit the supported Python versions.)
Then, specify a model file using the --rascal-model
command-line argument, or via the EMLE_RASCAL_MODEL
environment variable.
Note that the chosen backend must match the one used to train the model. At present this is left to the user to specify. In future we aim to encode the backend in the model file so that it can be selected automatically.
We currently support CPU
and CUDA
as the device for PyTorch.
This can be configured using the EMLE_DEVICE
environment variable, or by
using the --device
argument when launching emle-server
, e.g.:
emle-server --device cuda
When no device is specified, we will preferentially try to use CUDA
if it is
available. By default, the first CUDA
device index will be used. If you want
to specify the index, e.g. when running on a multi-GPU setup, then you can use
the following syntax:
emle-server --device cuda:1
This would tell PyTorch
that we want to use device index 1
. The same formatting
works for the environment varialbe, e.g. EMLE_DEVICE=cuda:1
.
We support electrostatic, mechanical, non-polarisable, and MM embedding.
Here non-polarisable emedding uses the EMLE model to predict charges for the
QM region, but ignores the induced component of the potential. MM embedding
allows the user to specify fixed MM charges for the QM atoms, with induction
once again disabled. Obviously we are advocating our electrostatic embedding
scheme, but the use of different embedding schemes provides a useful reference
for determining the benefit of using electrostatic embedding for a given system.
The embedding method can be specified using the EMLE_METHOD
environment
variable, or when launching the server, e.g.:
emle-server --method mechanical
The default option is (unsurprisingly) electrostatic
. When using MM
embedding you will also need to specify MM charges for the atoms within the
QM region. This can be done using the --mm-charges
option, or via the
EMLE_MM_CHARGES
environment variable. The charges can be specified as a list
of floats (space separated from the command-line, comma-separated when using
the environment variable) or a path to a file. When using a file, this should
be formatted as a single column, with one line per QM atom. The units
are electron charge.
We support two methods for the calculation of atomic polarisabilities. The
default, species
, uses a single volume scaling factor for each species.
Alternatively, reference
, calculates the scaling factors using Gaussian
Process Regression (GPR) using the values learned for each reference environment.
The alpha mode can be specified using the --alpha-mode
command-line argument,
or via the EMLE_ALPHA_MODE
environment variable.
Energies can be written to a file using the --energy-file
command-line argument
or the EMLE_ENERGY_FILE
environment variable. The frequency of logging can be
specified using --energy-frequency
or EMLE_ENERGY_FREQUENCY
. This should be an
integer specifying the frequency, in integration steps, at which energies are
written. (The default is 0, which means that energies aren't logged.) The output
will look something like the following, where the columns specify the current
step, the in vacuo energy and the total energy.
General log messages are written to the file specified by the --log-file
or
EMLE_LOG_FILE
options. (When using the Python API, by default, no log file is
used and diagnostic messages are written to sys.stderr
. When using emle-server
,
logs are by default redirected to emle_log.txt
.) The log level can be adjusted
by using the --log-level
or EMLE_LOG_LEVEL
options. For performance, the default
log level is set to ERROR
.
# Step E_vac (Eh) E_tot (Eh)
0 -495.724193647246 -495.720214843750
1 -495.724193662147 -495.720214843750
2 -495.722049429755 -495.718475341797
3 -495.717705026011 -495.714660644531
4 -495.714381769041 -495.711761474609
5 -495.712389051656 -495.710021972656
6 -495.710483833889 -495.707977294922
7 -495.708991110067 -495.706909179688
8 -495.708890005688 -495.707183837891
9 -495.711066677908 -495.709045410156
10 -495.714580371718 -495.712799072266
The xyz coordinates of the QM (ML) and MM regions can be logged by providing the
--qm-xyz-frequency
command-line argument or by setting the EMLE_QM_XYZ_FREQUENCY
environment variable (default is 0, indicating no logging). This generates a
qm.xyz
file (can be changed by --qm-xyz-file
argument or the EMLE_QM_XYZ_FILE
environment variable) as an XYZ trajectory for the QM region, and a pc.xyz
file
(controlled by --pc-xyz-file
argument or the EMLE_PC_XYZ_FILE
environment
variable) with the following format:
<number of point charges in frame1>
charge_1 x y z
charge_2 x y z
...
charge_n x y z
<number of point charges in frame2>
charge_1 x y z
charge_2 x y z
...
The qm.xyz
and pc.xyz
files can be used for error analysis as described below.
The EMLE implementation uses several ML frameworks to predict energies
and gradients. DeePMD-kit
or TorchANI can be used for the in vacuo
predictions and custom PyTorch code is used to predict
corrections to the in vacuo values in the presence of point charges.
The frameworks make heavy use of
just-in-time compilation.
This compilation is performed during to the first EMLE call, hence
subsequent calculatons are much faster. By using a long-lived server
process to handle EMLE calls from sander
we can get the performance
gain of JIT compilation.
A demo showing how to run EMLE on a solvated alanine dipeptide system can be found in the demo directory. To run:
cd demo
./demo.sh
Output will be written to the demo/output
directory.
It is possible to use emle-engine
to perform end-state correction (ESC)
for alchemical free-energy calculations. Here a λ value is used to
interpolate between the full MM (λ = 0) and EMLE (λ = 1) modified
potential. To use this feature specify the λ value from the command-line,
e.g.:
emle-server --lambda-interpolate 0.5
or via the ESC_LAMBDA_INTERPOLATE
environment variable. When performing
interpolation it is also necessary to specifiy the path to a topology file
for the QM region. This can be specified using the --parm7
command-line
argument, or via the EMLE_PARM7
environment variables You will also need to
specify the (zero-based) indices of the atoms within the QM region. To do so,
use the --qm-indices
command-line argument, or the EMLE_QM_INDICES
environment
variable. Finally, you will need specify MM charges for the QM atoms using
the --mm-charges
command-line argument or the EMLE_MM_CHARGES
environment
variable. These are used to calculate the electrostatic interactions between
point charges on the QM and MM regions.
It is possible to pass one or two values for λ. If a single value is used, then
the calculator will always use that value for interpolation, unless it is updated
externally using the --set-lambda-interpolate
command line option, e.g.:
emle-server --set-lambda-interpolate 1
Alternatively, if two values are passed then these will be used as initial and
final values of λ, with the additional --interpolate-steps
option specifying
the number of steps (calls to the server) over which λ will be linearly
interpolated. (This can also be specified using the EMLE_INTERPOLATE_STEPS
environment variable.) In this case the energy file (if written) will contain
output similar to that shown below. The columns specify the current step, the
current λ value, the energy at the current λ value, and the pure MM and EMLE
energies.
# Step λ E(λ) (Eh) E(λ=0) (Eh) E(λ=1) (Eh)
0 0.000000000000 -0.031915396452 -0.031915396452 -495.735900878906
5 0.100000000000 -49.588279724121 -0.017992891371 -495.720855712891
10 0.200000000000 -99.163040161133 -0.023267691955 -495.722106933594
15 0.300000000000 -148.726318359375 -0.015972195193 -495.717071533203
20 0.400000000000 -198.299896240234 -0.020024012774 -495.719726562500
25 0.500000000000 -247.870407104492 -0.019878614694 -495.720947265625
30 0.600000000000 -297.434417724609 -0.013046705164 -495.715332031250
35 0.700000000000 -347.003417968750 -0.008571878076 -495.715515136719
40 0.800000000000 -396.570098876953 -0.006970465649 -495.710876464844
45 0.900000000000 -446.150207519531 -0.019694851711 -495.720275878906
50 1.000000000000 -495.725952148438 -0.020683377981 -495.725952148438
We provide an interface between emle-engine
and OpenMM via the
Sire molecular simulation framework. This allows QM/MM simulations
to be run with OpenMM using EMLE for the embedding model. This provides greatly
improved performance and flexibility in comparison to the sander
interface.
To use, first create an emle-sire
conda environment:
conda env create -f environment_sire.yaml
conda activate emle-sire
Next install emle-engine
into the environment:
pip install .
For instructions on how to use the emle-sire
interface, see the tutorial
documentation here.
When performing end-state correction simulations using the emle-sire
interface
there is no need to specify the lambda_interpolate
keyword when creating an
EMLECalculator
instance. Instead, interpolation can be enabled when creating a
Sire
dynamics object via the same keyword. (See the tutorial for details.)
The emle.models
module provides a number of torch
models. The base EMLE
model
can be used to compute the EMLE energy in isolation. The combined ANI2xEMLE
and MACEEMLE
models allow the computation of in vacuo and embedding energies
in one go, using the ANI2x and MACE models respectively. Creating additional models is straightforward. For details of how to use the torch
models,
see the tutorial documentation here.
emle-engine
provides a CLI tool emle-analyze
that facilitates analysis of
the performance of EMLE-based simulations. It requires a set of single point
reference calculations for a trajectory generated with emle-server
(currently
only ORCA is supported). It also requires MBIS
decomposition of the in vacuo electronic density of the QM region with
horton. Usage:
emle-analyze --qm-xyz qm.xyz \
--pc.xyz pc.xyz \
--emle-model model.mat \
--orca-tarball orca.tar \
--backend [deepmd, ani2x]
--alpha
result.mat
qm.xyz
and pc.xyz
are the QM and MM XYZ trajectories written out by
emle-server
(see above in the "Logging" section).
model.mat
is the EMLE model
used.
orca.tar
is a tarball containing single point ORCA calculations and
corresponding horton outputs. All files should be named as index.*
where index
is a numeric value identifying the snapshot (does not have to be consecutive)
and the extensions are:
.vac.orca
: ORCA output for gas phase calculation. When--alpha
argument is provided, must also include molecular dipolar polarizability (%elprop Polar
).h5
: horton output for gas phase calculation.pc.orca
: ORCA output for calculation with point charges.pc
: charges and positions of the point charges (the ones used for.pc.orca
calculation).vpot
: output oforca_vpot
, electrostatic potential of gas phase system at the positions of the point charges
Optional --backend
argument allows to also extract the energies with the
in vacuo backend. Currently, only deepmd
and ani2x
backends are supported by
emle-analyze
. When deepmd
backend is used, the DeepMD model must be provided
with --deepmd-model
.
Training of custom EMLE models can be performed with emle-train
tool. It
requires a tarball with the reference QM calculations with the same naming
convention as the one for emle-analyze
script, with the difference that only
gas phase calculations are required and dipolar polarizabilies must be present.
Simple usage:
emle-train --orca-tarball orca.tar model.mat
The resulting model.mat
file can be directly used as --emle-model
argument
for emle-server
. A full list of argument and their default values can be
printed with emle-train -h
:
usage: emle-train [-h] --orca-tarball name.tar [--train-mask] [--sigma] [--ivm-thr] [--epochs]
[--lr-qeq] [--lr-thole] [--lr-sqrtk] [--print-every] [--computer-n-species]
[--computer-zid-map] [--plot-data name.mat]
output
EMLE training script
positional arguments:
output Output model file
options:
-h, --help show this help message and exit
--orca-tarball name.tar
ORCA tarball (default: None)
--train-mask Mask for training set (default: None)
--sigma Sigma value for GPR (default: 0.001)
--ivm-thr IVM threshold (default: 0.05)
--epochs Number of training epochs (default: 100)
--lr-qeq Learning rate for QEq params (a_QEq, chi_ref) (default: 0.05)
--lr-thole Learning rate for Thole model params (a_Thole, k_Z) (default: 0.05)
--lr-sqrtk Learning rate for polarizability scaling factors (sqrtk_ref) (default: 0.05)
--print-every How often to print training progress (default: 10)
--computer-n-species
Number of species supported by AEV computer (default: None)
--computer-zid-map Map between EMLE and AEV computer zid values (default: None)
--plot-data name.mat Data for plotting (default: None)
train-mask
if a file containing 0's and 1's that defines the subset of the full
training set (provided as --orca-tarball
) that is used for training. Note that
the values written to --plot-data
are for the full training set, which allows to
do prediction plots for train/test sets.
The --computer-n-species
and --computer-zid-map
arguments are only needed when
using a common AEV computer for both gas phase backend and EMLE model.
The DeePMD-kit conda package pulls in a version of MPI which may cause
problems if using ORCA as the in vacuo backend, particularly when running
on HPC resources that might enforce a specific MPI setup. (ORCA will
internally call mpirun
to parallelise work.) Since we don't need any of
the MPI functionality from DeePMD-kit
, the problematic packages can be
safely removed from the environment with:
conda remove --force mpi mpich
Alternatively, if performance isn't an issue, simply set the number of
threads to 1 in the sander
input file, e.g.:
&orc
method='XTB2',
num_threads=1
/
When running on an HPC resource it can often take a while for the emle-server
to start. As such, the client will try reconnecting on failure a specified
number of times before raising an exception. (Sleeping 2 seconds between
retries.) By default, the client tries will try to connect 100 times. If this
is unsuitable for your setup, then the number of attempts can be configured
using the EMLE_RETRIES
environment variable.
When performing interpolation it is currently not possible to use AMBER force
fields with CMAP terms due to a memory deallocation bug in pysander
.