CompPhysics
diff --git a/‎doc/pub/week13/html/week13-bs.html‎
Lines changed: 23 additions & 23 deletions b/‎doc/pub/week13/html/week13-bs.html‎
Lines changed: 23 additions & 23 deletions
@@ -705,10 +705,10 @@ <h2 id="analysis" class="anchor">Analysis </h2>
 <h2 id="the-vae" class="anchor">The VAE </h2>
 
 <p>In the default formulation of the VAE by Kingma and Welling (2015), we directly maximize the ELBO.  This
-approach is \textit{variational}, because we optimize for the best
+approach is <b>variational</b>, because we optimize for the best
 \( q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x}) \) amongst a family of potential posterior
 distributions parameterized by \( \boldsymbol{\phi} \).  It is called an
-\textit{autoencoder} because it is reminiscent of a traditional
+<b>autoencoder</b> because it is reminiscent of a traditional
 autoencoder model, where input data is trained to predict itself after
 undergoing an intermediate bottlenecking representation step.
 </p>
@@ -734,11 +734,11 @@ <h2 id="bottlenecking-distribution" class="anchor">Bottlenecking distribution </
 
 <p>In this case, we learn an intermediate bottlenecking distribution
 \( q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x}) \) that can be treated as
-an \textit{encoder}; it transforms inputs into a distribution over
+an <b>encoder</b>; it transforms inputs into a distribution over
 possible latents.  Simultaneously, we learn a deterministic function
 \( p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h}) \) to convert a given latent vector
 \( \boldsymbol{h} \) into an observation \( \boldsymbol{x} \), which can be interpreted as
-a \textit{decoder}.
+a <b>decoder</b>.
 </p>
 
 <!-- !split -->
@@ -785,7 +785,7 @@ <h2 id="reparameterization-trick" class="anchor">Reparameterization trick </h2>
 <p>However, a problem arises in this default setup: each \( \boldsymbol{h}^{(l)} \)
 that our loss is computed on is generated by a stochastic sampling
 procedure, which is generally non-differentiable.  Fortunately, this
-can be addressed via the \textit{reparameterization trick} when
+can be addressed via the <b>reparameterization trick</b> when
 \( q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x}) \) is designed to model certain
 distributions, including the multivariate Gaussian.
 </p>
@@ -856,9 +856,9 @@ <h2 id="after-training" class="anchor">After training </h2>
 <h2 id="setting-up-sgd" class="anchor">Setting up SGD </h2>
 <p>So how can we perform stochastic gradient descent?</p>
 
-<p>First we need to be a bit more specific about the form that \( Q(\boldsymbol{h}|\boldsymbol{x}) \)
+<p>First we need to be a bit more specific about the form that \( q(\boldsymbol{h}|\boldsymbol{x}) \)
 will take.  The usual choice is to say that
-\( Q(\boldsymbol{h}|\boldsymbol{x})=\mathcal{N}(\boldsymbol{h}|\mu(\boldsymbol{x};\vartheta),\Sigma(;\vartheta)) \), where
+\( q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})=\mathcal{N}(\boldsymbol{h}|\mu(\boldsymbol{x};\vartheta),\Sigma(;\vartheta)) \), where
 \( \mu \) and \( \Sigma \) are arbitrary deterministic functions with
 parameters \( \vartheta \) that can be learned from data (we will omit
 \( \vartheta \) in later equations).  In practice, \( \mu \) and \( \Sigma \) are
@@ -873,7 +873,7 @@ <h2 id="more-on-the-sgd" class="anchor">More on the SGD </h2>
 the fact that \( \mu \) and \( \Sigma \) are &quot;encoding&quot; \( \boldsymbol{x} \) into the latent
 space \( \boldsymbol{h} \).  The advantages of this choice are computational, as they
 make it clear how to compute the right hand side.  The last
-term---\( \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right] \)---is now a KL-divergence
+term---\( \mathcal{D}\left[q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right] \)---is now a KL-divergence
 between two multivariate Gaussian distributions, which can be computed
 in closed form as:
 </p>
@@ -901,16 +901,16 @@ <h2 id="simplification" class="anchor">Simplification </h2>
 <h2 id="terms-to-compute" class="anchor">Terms to compute </h2>
 
 <p>The first term on the right hand side is a bit more tricky.
-We could use sampling to estimate \( E_{z\sim Q}\left[\log P(X|z)  \right] \), but getting a good estimate would require passing many samples of \( z \) through \( f \), which would be expensive.
-Hence, as is standard in stochastic gradient descent, we take one sample of \( z \) and treat \( \log P(X|z) \) for that \( z \) as an approximation of \( E_{z\sim Q}\left[\log P(X|z)  \right] \).
-After all, we are already doing stochastic gradient descent over different values of \( X \) sampled from a dataset \( D \).
+We could use sampling to estimate \( E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p_{\Theta}(x|h)  \right] \), but getting a good estimate would require passing many samples of \( h \) through some function \( f \), which would be expensive.
+Hence, as is standard in stochastic gradient descent, we take one sample of \( h \) and treat \( \log p(x|h) \) for that \( h \) as an approximation of \( E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p(x|h)  \right] \).
+After all, we are already doing stochastic gradient descent over different values of \( x \) sampled from a dataset \( X \).
 The full equation we want to optimize is:
 </p>
 
 $$
 \begin{array}{c}
-    E_{X\sim D}\left[\log P(X) - \mathcal{D}\left[Q(z|X)\|P(z|X)\right]\right]=\hspace{16em}\\
-\hspace{10em}E_{X\sim D}\left[E_{z\sim Q}\left[\log P(X|z)  \right] - \mathcal{D}\left[Q(z|X)\|P(z)\right]\right].
+    E_{x\sim X}\left[\log p(X) - \mathcal{D}\left[q_{\boldsymbol{\phi}}(h|X)\|p(h|x)\right]\right]=\hspace{16em}\\
+\hspace{10em}E_{x\sim X}\left[E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p(x|h)  \right] - \mathcal{D}\left[q_{\boldsymbol{\phi}}(h|x)\|p(h)\right]\right].
 \end{array}
 $$
 
@@ -919,24 +919,24 @@ <h2 id="terms-to-compute" class="anchor">Terms to compute </h2>
 <h2 id="computing-the-gradients" class="anchor">Computing the gradients </h2>
 
 <p>If we take the gradient of this equation, the gradient symbol can be moved into the expectations.
-Therefore, we can sample a single value of \( X \) and a single value of \( z \) from the distribution \( Q(z|X) \), and compute the gradient of:
+Therefore, we can sample a single value of \( x \) and a single value of \( h \) from the distribution \( q_{\boldsymbol{\phi}}(h|x) \), and compute the gradient of:
 </p>
 $$
 \begin{equation}
- \log P(X|z)-\mathcal{D}\left[Q(z|X)\|P(z)\right].
+ \log p(x|h)-\mathcal{D}\left[q_{\boldsymbol{\phi}}(h|x)\|p(h)\right].
 \label{_auto1}
 \end{equation}
 $$
 
-<p>We can then average the gradient of this function over arbitrarily many samples of \( X \) and \( z \), and the result converges to the gradient.</p>
+<p>We can then average the gradient of this function over arbitrarily many samples of \( x \) and \( h \), and the result converges to the gradient.</p>
 
 <p>There is, however, a significant problem
-\( E_{z\sim Q}\left[\log P(X|z)  \right] \) depends not just on the parameters of \( P \), but also on the parameters of \( Q \).
+\( E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p(x|h)  \right] \) depends not just on the parameters of \( p \), but also on the parameters of \( q_{\boldsymbol{\phi}} \).
 </p>
 
-<p>In order to make VAEs work, it is essential to drive \( Q \) to produce codes for \( X \) that \( P \) can reliably decode.  </p>
+<p>In order to make VAEs work, it is essential to drive \( q_{\boldsymbol{\phi}} \) to produce codes for \( x \) that \( p \) can reliably decode.  </p>
 $$
- E_{X\sim D}\left[E_{\epsilon\sim\mathcal{N}(0,I)}[\log P(X|z=\mu(X)+\Sigma^{1/2}(X)*\epsilon)]-\mathcal{D}\left[Q(z|X)\|P(z)\right]\right].
+ E_{x\sim X}\left[E_{\epsilon\sim\mathcal{N}(0,I)}[\log p(x|h=\mu(X)+\Sigma^{1/2}(X)*\epsilon)]-\mathcal{D}\left[q_{\boldsymbol{\phi}}(h|x)\|p(h)\right]\right].
 $$
 
 
@@ -1318,11 +1318,11 @@ <h2 id="diffusion-models-basics" class="anchor">Diffusion models, basics </h2>
 <h2 id="problems-with-probabilistic-models" class="anchor">Problems with probabilistic models </h2>
 
 <p>Historically, probabilistic models suffer from a tradeoff between two
-conflicting objectives: \textit{tractability} and
-\textit{flexibility}. Models that are \textit{tractable} can be
+conflicting objectives: <b>tractability</b> and
+<b>flexibility</b>. Models that are <b>tractable</b> can be
 analytically evaluated and easily fit to data (e.g. a Gaussian or
 Laplace). However, these models are unable to aptly describe structure
-in rich datasets. On the other hand, models that are \textit{flexible}
+in rich datasets. On the other hand, models that are <b>flexible</b>
 can be molded to fit structure in arbitrary data. For example, we can
 define models in terms of any (non-negative) function \( \phi(\boldsymbol{x}) \)
 yielding the flexible distribution \( p\left(\boldsymbol{x}\right) =
@@ -1622,7 +1622,7 @@ <h2 id="interpretations" class="anchor">Interpretations </h2>
 <h2 id="the-last-term" class="anchor">The last term </h2>
 
 <ul>
-<li> \( \mathbb{E}_{q(\boldsymbol{x}_{t-1}, \boldsymbol{x}_{t+1}|\boldsymbol{x}_0)}\left[D_{KL}(q(\boldsymbol{x}_{t}|\boldsymbol{x}_{t-1})\vert\vert p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t}|\boldsymbol{x}_{t+1}))\right] \) is a \textit{consistency term}; it endeavors to make the distribution at \( \boldsymbol{x}_t \) consistent, from both forward and backward processes.  That is, a denoising step from a noisier image should match the corresponding noising step from a cleaner image, for every intermediate timestep; this is reflected mathematically by the KL Divergence.  This term is minimized when we train \( p_{\theta}(\boldsymbol{x}_t|\boldsymbol{x}_{t+1}) \) to match the Gaussian distribution \( q(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}) \).</li>
+<li> \( \mathbb{E}_{q(\boldsymbol{x}_{t-1}, \boldsymbol{x}_{t+1}|\boldsymbol{x}_0)}\left[D_{KL}(q(\boldsymbol{x}_{t}|\boldsymbol{x}_{t-1})\vert\vert p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t}|\boldsymbol{x}_{t+1}))\right] \) is a <b>consistency term</b>; it endeavors to make the distribution at \( \boldsymbol{x}_t \) consistent, from both forward and backward processes.  That is, a denoising step from a noisier image should match the corresponding noising step from a cleaner image, for every intermediate timestep; this is reflected mathematically by the KL Divergence.  This term is minimized when we train \( p_{\theta}(\boldsymbol{x}_t|\boldsymbol{x}_{t+1}) \) to match the Gaussian distribution \( q(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}) \).</li>
 </ul>
 <!-- !split -->
 <h2 id="diffusion-models-part-2-from-url-https-arxiv-org-abs-2208-11970" class="anchor">Diffusion models, part 2, from <a href="https://arxiv.org/abs/2208.11970" target="_self"><tt>https://arxiv.org/abs/2208.11970</tt></a>  </h2>