Initial commit

jthorton · Oct 7, 2021 · 7bd59b7 · 7bd59b7
1 parent cc2c2d2
commit 7bd59b7
Show file tree

Hide file tree

Showing 15 changed files with 446 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,62 @@
+Double exponential vdW functional form 
+======================================
+
+This repository contains the scripts, inputs and the results generated as part of the *...* publication.
+
+**Warning:** This repository, its structure and contents, are currently in a state of flux and incompleteness while the 
+study is ongoing. We *do not* guarantee the scientific correctness of anything found within, nor do we yet recommend 
+using any force field parameters found here.
+
+### Structure
+
+This repository is structured into four main directories:
+
+* `data-set-curation` - contains the script used to curate the training and test data sets.
+
+* `inputs-and-results` - contains the *most up to date* input files required to reproduce this study. **See the Reproduction** section of this README for more information. The project structure was for the most part generated automatically using the [`nonbonded`](https://github.com/SimonBoothroyd/nonbonded) package.
+
+* `schema` - contains schemas which define most parts of the project, including definitions of
+  which optimizations and benchmarks were performed.
+
+* `scripts` - contains the script used to generate the input schemas / files, and scripts which perform ancillary data 
+  analysis.
+
+### Experimental Data Sets
+
+The experimental data sets used in this project were curated from the [NIST ThermoML](https://trc.nist.gov/ThermoML.html)
+archive. The citations for the individual measurements can be found in `DATA_CITATIONS.bib` 
+
+### Reproduction
+
+The exact inputs used and outputs reported (including the conda environment used to generate them) in the publication 
+have been included as tagged releases to this repository. 
+
+For those looking to reproduce the study, the required dependencies may be obtained directly using conda:
+
+```bash
+conda env create --name double-exp-vdw --file environment.yaml
+```
+
+#### Optimizations
+
+In most cases the optimizations can be re-run using the following commands
+
+```bash
+cd inputs-and-results/optimizations/XXX/
+nonbonded optimization run
+nonbonded optimization analyze
+```
+
+while the QM optimizations may be re-run, e.g., according to
+
+```bash
+cd inputs-and-results/optimizations/vdw-v1-td-opt-vib-v1/
+ForceBalance optimize.in
+``` 
+
+A more complete set of instructions for performing QM fits with ForceBalance can be found [here](https://github.com/openforcefield/openforcefield-forcebalance)
+
+#### Benchmarks
+
+A comprehensive set of instructions for re-running the benchmarks can be found in the `inputs-and-results/benchmarks` 
+directory.
diff --git a/data-set-curation/README.md b/data-set-curation/README.md
@@ -0,0 +1,11 @@
+# Physical Property Data Sets
+
+Within these directories are the scripts used to curate the experimental physical property
+data sets that the vdW force field parameters were trained and tested against.
+
+In particular:
+
+* ...
+
+All data sets were curated using the utilities provided by the `openff-evaluator` package and stored
+for easy access in both `nonbonded` data set objects and pandas csv files.
diff --git a/data-set-curation/physical-property/README.md b/data-set-curation/physical-property/README.md
@@ -0,0 +1,10 @@
+# Physical Property Data Sets
+
+Within these directories are the scripts used to curate the experimental physical property
+data sets that the vdW force field parameters were trained (`optimizations/`) and tested (`benchmarks/`) 
+against.
+
+All data sets were curated using the utilities provided by the `openff-evaluator` package and stored
+for easy access in both `nonbonded` data set objects and pandas csv files.
+
+Instructions for curating the sets can be found in each of the sub-directories.
diff --git a/data-set-curation/physical-property/benchmarks/README.md b/data-set-curation/physical-property/benchmarks/README.md
@@ -0,0 +1,24 @@
+# Physical Property Benchmark Data Sets
+
+Within this directories are the scripts used to curate the experimental solvation / transfer free energy physical 
+property data sets that the vdW force field parameters were tested against.
+
+The data is sourced from both the FreeSolv and the Minnesota Solvation (MNSol) Databases. Due to how the MNSol is
+licensed we do not include it here, however it can easily be obtained from the [main website](https://conservancy.umn.edu/handle/11299/213300).
+
+## Curating the data sets
+
+After obtaining a copy of the MNSol set:
+
+1) Run the `mnsol-to-csv.py` script to convert the MNSol data from an XLS file to a more easily 
+   manipulable CSV file.
+
+2) Select a diverse and representative subset of the full MNSol by running the `curate-mnsol-test-set.py`.
+   This will create both a `sage-mnsol-test-v1.json` and a `sage-mnsol-test-v1.csv` file in the `data-sets`
+   directory which contain the selected test set data in two convenient formats. For completeness, we have 
+   included a `sage-mnsol-test-v1.txt` file which contains exactly the substances used in our test set but 
+   which excludes the actual values due to licensing concerns.
+
+3) Select a subset of FreeSolv whereby each selected solute also appears as a solute in the selected MNSol
+   set. This allows non-aques to aqueous transfer free energies to be computed. This will create both a 
+   `sage-fsolv-test-v1.json` and a `sage-fsolv-test-v1.csv` file in the `data-sets` directory.
diff --git a/data-set-curation/physical-property/optimizations/README.md b/data-set-curation/physical-property/optimizations/README.md
@@ -0,0 +1,7 @@
+# Physical Property Optimization Data Sets
+
+Within this directories are the scripts used to curate the experimental density and enthalpy of mixing physical 
+property data sets that the vdW force field parameters were trained against.
+
+The data is sourced from the [ThermoML archive](https://www.nist.gov/mml/acmd/trc/thermoml) using the 
+`openff-evaluator` packages data curation tools.
diff --git a/data-set-curation/quantum-chemical/README.md b/data-set-curation/quantum-chemical/README.md
@@ -0,0 +1,9 @@
+# Quantum Chemical Data Sets
+
+This directory contains the scripts used to curate the different quantum chemical *training* sets from
+data stored in the [QCArchive](https://qcarchive.molssi.org/), as well as the final curated data sets.
+
+**Note**: As the QCArchive is a constantly evolving and updating collection it is possible that running the
+          scripts provided here will yield slightly different curated collections depending on if new results
+          have become available. For full reproducibility we have in the `data-sets` directory provided JSON
+          records which encode the **exact** ids of QC record used during training.
diff --git a/data-set-curation/quantum-chemical/data-sets/.gitkeep b/data-set-curation/quantum-chemical/data-sets/.gitkeep
diff --git a/environment.yaml b/environment.yaml
@@ -0,0 +1,25 @@
+name: double-exp-vdw
+
+channels:
+  - simonboothroyd
+  - openeye
+  - conda-forge
+  - defaults
+
+dependencies:
+
+  - python
+  - pip
+
+  - openeye-toolkits
+
+  # Common dependencies
+  - tqdm
+  - seaborn
+
+  # vdW fitting
+  - openff-evaluator >=0.3.6
+  - nonbonded >= 0.0.1b1
+  - distributed ==2.30.1
+  - mdtraj ==1.9.4
+  - smirnoff-plugins >=0.0.2
diff --git a/inputs-and-results/README.md b/inputs-and-results/README.md
@@ -0,0 +1,7 @@
+# Inputs and Results
+
+This directory contains the main input files required to reproduce the performed optimizations and benchmarks
+which were carried out while training the Sage line of OpenFF force fields.
+
+The individual `optimizations` and `benchmarks` sub-directories contain detailed instructions on how the
+input scripts and files should be used.
diff --git a/inputs-and-results/benchmarks/README.md b/inputs-and-results/benchmarks/README.md
@@ -0,0 +1,9 @@
+# Benchmarks
+
+Contained within this directory are the main inputs (or instructions on how they may be generated)
+for performing benchmarks of the produced force fields.
+
+## Manifest
+
+* `transfer-free-energy` - a directory containing the inputs required to perform solvation / transfer free energy
+  calculations of a diverse set of solutes / solvents using the `openff-evaluator` package.
diff --git a/inputs-and-results/benchmarks/transfer-free-energies/README.md b/inputs-and-results/benchmarks/transfer-free-energies/README.md
diff --git a/inputs-and-results/optimizations/README.md b/inputs-and-results/optimizations/README.md
@@ -0,0 +1,29 @@
+# Optimizations
+
+Contained within this directory are the main inputs (or instructions on how they may be generated)
+for training the matrix of force fields against both physical property and QC data.
+
+Currently, the inputs for the vdW fits against physical property data sets are provided, while the
+inputs for the valence fits will need to be generated manually due to their size.
+
+## Running the vdW fits
+
+Each of the physical property fits can be re-run using the
+
+```shell
+nonbonded optimization run --restart true
+```
+
+command. The ``--restart true`` flag will ensure that the optimization is correctly restarted if you
+are trying to continue from a previous attempt which failed for whatever reason.
+
+The outputs of the fit can be analyzed and plotted using:
+
+```shell
+nonbonded optimization analyze
+nonbonded optimization plot
+```
+
+which will produce a set of analyzed outputs in a new `analysis` directory, and a set of corresponding
+plots, including a trace of the objective function and RMSE metrics from before and after the fit, will
+be generated in a new `plots` directory.
diff --git a/scripts/build_initial_ff.py b/scripts/build_initial_ff.py
@@ -0,0 +1,127 @@
+"""Convert openff-2.0.0 to a double exponential style force field with a custom tip4p-de
+water model that has been pre-fit."""
+
+from openff.toolkit.typing.engines.smirnoff import ForceField, ParameterList
+from simtk import unit
+from smirnoff_plugins.handlers.nonbonded import DoubleExponential
+
+
+def add_tip4p_double_exp(force_field: ForceField):
+
+    double_exp_handler = force_field.get_parameter_handler("DoubleExponential")
+
+    # now add in pre-fit tip4p-de water model parameters
+    de_o = double_exp_handler.DEType(
+        "[#1]-[#8X2H2+0:1]-[#1]",
+        epsilon=0.21104 * unit.kilocalorie_per_mole,
+        r_min=3.5204 * unit.angstrom,
+        id="tip4p-de-O",
+    )
+    de_h = double_exp_handler.DEType(
+        "[#1:1]-[#8X2H2+0]-[#1]",
+        epsilon=0 * unit.kilocalorie_per_mole,
+        r_min=1 * unit.angstrom,
+        id="tip4p-de-H",
+    )
+
+    double_exp_handler._parameters.append(de_o)
+    double_exp_handler._parameters.append(de_h)
+
+    # add a library charge to zero the water and then use the virtual site to adjust the
+    # charges
+    lb_charges = force_field.get_parameter_handler("LibraryCharges")
+    lb_charges._parameters = ParameterList(
+        [
+            lb_charge
+            for lb_charge in force_field.get_parameter_handler("LibraryCharges")
+            if lb_charge.id is None or "tip3p" not in lb_charge.id
+        ]
+    )
+    lb_charges.add_parameter(
+        {
+            "smirks": "[#1:1]-[#8X2H2+0:2]-[#1:3]",
+            "id": "tip4p-de",
+            "charge1": 0 * unit.elementary_charge,
+            "charge2": 0 * unit.elementary_charge,
+            "charge3": 0 * unit.elementary_charge,
+        }
+    )
+
+    virtual_site_handler = force_field.get_parameter_handler("VirtualSites")
+    virtual_site_handler.add_parameter(
+        {
+            "smirks": "[#1:1]-[#8X2H2+0:2]-[#1:3]",
+            "type": "DivalentLonePair",
+            "distance": -0.010743 * unit.nanometers,
+            "outOfPlaneAngle": 0.0 * unit.degrees,
+            "match": "once",
+            "charge_increment1": 0.53254 * unit.elementary_charge,
+            "charge_increment2": 0.0 * unit.elementary_charge,
+            "charge_increment3": 0.53254 * unit.elementary_charge,
+        }
+    )
+    # Currently required due to OpenFF issue #884
+    virtual_site_handler._parameters = ParameterList(virtual_site_handler._parameters)
+
+    # Setup the needed constraints
+    constraints = force_field.get_parameter_handler("Constraints")
+
+    for parameter in constraints.parameters:
+        parameter.id = parameter.id.replace("tip3p", "tip4p")
+
+
+def build_double_exp_force_field(
+    initial_force_field_path: str = "openff-2.0.0.offxml", water_model: str = "tip4p-de"
+) -> ForceField:
+
+    water_models = ["tip3p", "tip4p-de"]
+    assert water_model in water_models, f"the allowed water models are {water_models}"
+
+    ff = ForceField(initial_force_field_path, load_plugins=True)
+
+    # create the new double exponential handler and configure with the alpha and beta
+    # from the water fits
+    double_exp_handler = DoubleExponential(version="0.3")
+    double_exp_handler.scale14 = 0.5
+    double_exp_handler.alpha = 16.789
+    double_exp_handler.beta = 4.592
+
+    # loop over the vdw and transfer the parameters and zero out the old LJ 12-6 terms
+    vdw = ff.get_parameter_handler("vdW")
+
+    for lj_parameter in vdw.parameters:
+
+        if "tip3p" in lj_parameter.id:  # Skip TIP3P so we can use the custom TIP4P-DE
+            continue
+
+        de_parameter = double_exp_handler.DEType(
+            lj_parameter.smirks,
+            r_min=(lj_parameter.sigma ** 6 * 2) ** (1 / 6),
+            epsilon=lj_parameter.epsilon,
+        )
+        double_exp_handler._parameters.append(de_parameter)
+
+    ff.deregister_parameter_handler("vdW")
+
+    # we have to modify this parameter manually as this can cause the long range
+    # correction in openmm to fail. This parameter is a place holder with almost zero
+    # values, with epsilon mostly transferred to the oxygen here we make the r_min
+    # bigger to work but we could probably zero out this term completely?
+    problem_parameter = double_exp_handler.parameters["[#1:1]-[#8]"]
+    problem_parameter.r_min = 1.0 * unit.angstrom
+
+    ff.register_parameter_handler(double_exp_handler)
+
+    if water_model == "tip4p-de":
+        add_tip4p_double_exp(ff)
+    elif water_model == "tip3p":
+        pass
+    else:
+        raise NotImplementedError()
+
+    return ff
+
+
+if __name__ == "__main__":
+    ff = build_double_exp_force_field()
+    ff.to_file("double-exp-ff.offxml")