Improved kernel description and normalizing flows (#228)

whitead · web-flow · commit 58e4e4885870 · 2022-12-17T18:32:08.000-05:00
diff --git a/dl/flows.ipynb b/dl/flows.ipynb
@@ -44,7 +44,9 @@
     "\n",
     "A bijector is a function that is [injective](https://en.wikipedia.org/wiki/Injective_function) (1 to 1) and [surjective](https://en.wikipedia.org/wiki/Surjective_function) (onto). An equivalent way to view a bijective function is if it has an inverse. For example, a sum reduction has no inverse and is thus not bijective. $\\sum [1,0] = 1$ and $\\sum [-1, 2] = 1$. Multiplying by a matrix which has an inverse is bijective. $y = x^2$ is not bijective, since $y = 4$ has two solutions. \n",
     "\n",
-    "Remember that we must compute the determinant of the bijector Jacobian. If the Jacobian is dense (all output elements depend on all input elements), computing this quantity will be $O\\left(|x|_0^3\\right)$ where $|x|_0$ is the number of dimensions of $x$ because a determinant scales by $O(n^3)$. This would make computing normalizing flows impractical in high-dimensions. However, in practice we restrict ourselves to bijectors that have easy to calculate Jacobians. For example, if the bijector is $x_i = \\cos z_i$ then the Jacobian will be diagonal. Typically, the trick that is done is to make the Jacobian triangular. Then $x_0$ only depends on $z_0$, $z_1$ depends on $z_0, Z_1$, and $x_2$ depends on $z_0, z_1, z_2$, etc. The matrix determinant is then computed in linear time with respect to the number of dimensions.\n",
+    "Remember that we must compute the determinant of the bijector Jacobian. If the Jacobian is dense (all output elements depend on all input elements), computing this quantity will be $O\\left(|x|_0^3\\right)$ where $|x|_0$ is the number of dimensions of $x$ because a determinant scales by $O(n^3)$. This would make computing normalizing flows impractical in high-dimensions. However, in practice we restrict ourselves to bijectors that have easy to calculate Jacobians. For example, if the bijector is $x_i = \\cos z_i$ then the Jacobian will be diagonal. Such a diagonal Jacobian means that each dimension is independent of the other though. \n",
+    "\n",
+    "One way to get faster determinants without just making each dimension independent is to get a triangular Jacobian. Then $x_0$ only depends on $z_0$, $x_1$ depends on $z_0, z_1$, and $x_2$ depends on $z_0, z_1, z_2$, etc. This enables fitting high-dimensional correlations for some of the dimensions (like $x_{n}$). The matrix determinant of a triangular matrix is computed in linear time with respect to the number of dimensions -- because it is just the product of the matrix diagonal. \n",
     "\n",
     "### Bijector Chains\n",
     "\n",
@@ -518,7 +520,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.5"
+   "version": "3.8.12"
   }
  },
  "nbformat": 4,
diff --git a/ml/kernel.ipynb b/ml/kernel.ipynb
@@ -117,9 +117,18 @@
     "feature_names = soldata.columns[features_start_at:]\n",
     "np.random.seed(0)\n",
     "\n",
-    "# standardize the features\n",
-    "soldata[feature_names] -= soldata[feature_names].mean()\n",
-    "soldata[feature_names] /= soldata[feature_names].std()"
+    "# Split into train(80) and test(20)\n",
+    "N = len(soldata)\n",
+    "split = int(N * 0.8)\n",
+    "shuffled = soldata.sample(N, replace=False)\n",
+    "train = shuffled[:N]\n",
+    "test = shuffled[N:]\n",
+    "\n",
+    "# standardize the features using only train\n",
+    "test[feature_names] -= train[feature_names].mean()\n",
+    "test[feature_names] /= train[feature_names].std()\n",
+    "train[feature_names] -= train[feature_names].mean()\n",
+    "train[feature_names] /= train[feature_names].std()"
    ]
   },
   {
@@ -128,7 +137,9 @@
    "source": [
     "## Kernel Definition\n",
     "\n",
-    "We'll start by creating our kernel function. *Our kernel function does not need to be differentiable*. In contrast to the functions we see in deep learning, we can use sophisticated and non-differentiable functions in kernel learning. For example, you could use a two-component molecular dynamics simulation to compute the kernel between two molecules.  We'll still implement our kernel functions in JAX for this example because it is efficient and consistent. Remember our kernel should take two feature vectors and return a scalar. In our example, we will simply use the $l_2$ norm. I will add one small twist though: dividing by the dimension. This makes our kernel output magnitude is independent of the number of dimensions of $x$. "
+    "We'll start by creating our kernel function. *Our kernel function does not need to be differentiable*. In contrast to the functions we see in deep learning, we can use sophisticated and non-differentiable functions in kernel learning. For example, you could use a two-component molecular dynamics simulation to compute the kernel between two molecules.  We'll still implement our kernel functions in JAX for this example because it is efficient and consistent. Remember our kernel should take two feature vectors and return a scalar. In our example, we will simply use the $l_2$ norm. There is one small change: dividing by the dimension. This makes our kernel output magnitude is independent of the number of dimensions of $x$. Other options for kernels for dense vectors are $1 - $ cosine similarity (dot product) or [Mahalanobis distance](https://en.wikipedia.org/wiki/Mahalanobis_distance).\n",
+    "\n",
+    "Choosing a kernel function is an open question for molecules. You can work with the molecular structure directly with a variety of ideas. You can use {doc}`../dl/gnn` to convert molecules into vectors and use any of the vector kernels above. You can work with fingerprints (vectors of bits) from cheminformatics libraries like Rdkit and compare such vectors with [Tanimoto similarity](https://en.wikipedia.org/wiki/Jaccard_index) {doc}`rankovic_griffiths_moss_schwaller_2022, yang2022classifying`. "
    ]
   },
   {
diff --git a/references.bib b/references.bib
@@ -1796,3 +1796,23 @@ @article{blumenthal2022ai
   year      = {2022},
   publisher = {Taylor \& Francis}
 }
+
+
+ @article{rankovic_griffiths_moss_schwaller_2022,
+  place     = {Cambridge},
+  title     = {Bayesian optimisation for additive screening and yield
+               improvements in chemical reactions -- beyond one-hot encodings},
+  doi       = {10.26434/chemrxiv-2022-nll2j},
+  journal   = {ChemRxiv},
+  publisher = {Cambridge Open Engage},
+  author    = {Ranković, Bojana and Griffiths, Ryan-Rhys and Moss, Henry B. and Schwaller, Philippe},
+  year      = {2022}
+}
+
+@article{yang2022classifying,
+  title     = {Classifying the toxicity of pesticides to honey bees via support vector machines with random walk graph kernels},
+  author    = {Yang, Ping and Henle, E Adrian and Fern, Xiaoli Z and Simon, Cory M},
+  journal   = {The Journal of Chemical Physics},
+  year      = {2022},
+  publisher = {AIP Publishing LLC}
+}