Skip to content

Commit f60e0bf

Browse files
committed
update
1 parent c8755f5 commit f60e0bf

File tree

8 files changed

+280
-280
lines changed

8 files changed

+280
-280
lines changed

doc/pub/week13/html/week13-bs.html

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -705,10 +705,10 @@ <h2 id="analysis" class="anchor">Analysis </h2>
705705
<h2 id="the-vae" class="anchor">The VAE </h2>
706706

707707
<p>In the default formulation of the VAE by Kingma and Welling (2015), we directly maximize the ELBO. This
708-
approach is \textit{variational}, because we optimize for the best
708+
approach is <b>variational</b>, because we optimize for the best
709709
\( q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x}) \) amongst a family of potential posterior
710710
distributions parameterized by \( \boldsymbol{\phi} \). It is called an
711-
\textit{autoencoder} because it is reminiscent of a traditional
711+
<b>autoencoder</b> because it is reminiscent of a traditional
712712
autoencoder model, where input data is trained to predict itself after
713713
undergoing an intermediate bottlenecking representation step.
714714
</p>
@@ -734,11 +734,11 @@ <h2 id="bottlenecking-distribution" class="anchor">Bottlenecking distribution </
734734

735735
<p>In this case, we learn an intermediate bottlenecking distribution
736736
\( q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x}) \) that can be treated as
737-
an \textit{encoder}; it transforms inputs into a distribution over
737+
an <b>encoder</b>; it transforms inputs into a distribution over
738738
possible latents. Simultaneously, we learn a deterministic function
739739
\( p_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{h}) \) to convert a given latent vector
740740
\( \boldsymbol{h} \) into an observation \( \boldsymbol{x} \), which can be interpreted as
741-
a \textit{decoder}.
741+
a <b>decoder</b>.
742742
</p>
743743

744744
<!-- !split -->
@@ -785,7 +785,7 @@ <h2 id="reparameterization-trick" class="anchor">Reparameterization trick </h2>
785785
<p>However, a problem arises in this default setup: each \( \boldsymbol{h}^{(l)} \)
786786
that our loss is computed on is generated by a stochastic sampling
787787
procedure, which is generally non-differentiable. Fortunately, this
788-
can be addressed via the \textit{reparameterization trick} when
788+
can be addressed via the <b>reparameterization trick</b> when
789789
\( q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x}) \) is designed to model certain
790790
distributions, including the multivariate Gaussian.
791791
</p>
@@ -856,9 +856,9 @@ <h2 id="after-training" class="anchor">After training </h2>
856856
<h2 id="setting-up-sgd" class="anchor">Setting up SGD </h2>
857857
<p>So how can we perform stochastic gradient descent?</p>
858858

859-
<p>First we need to be a bit more specific about the form that \( Q(\boldsymbol{h}|\boldsymbol{x}) \)
859+
<p>First we need to be a bit more specific about the form that \( q(\boldsymbol{h}|\boldsymbol{x}) \)
860860
will take. The usual choice is to say that
861-
\( Q(\boldsymbol{h}|\boldsymbol{x})=\mathcal{N}(\boldsymbol{h}|\mu(\boldsymbol{x};\vartheta),\Sigma(;\vartheta)) \), where
861+
\( q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})=\mathcal{N}(\boldsymbol{h}|\mu(\boldsymbol{x};\vartheta),\Sigma(;\vartheta)) \), where
862862
\( \mu \) and \( \Sigma \) are arbitrary deterministic functions with
863863
parameters \( \vartheta \) that can be learned from data (we will omit
864864
\( \vartheta \) in later equations). In practice, \( \mu \) and \( \Sigma \) are
@@ -873,7 +873,7 @@ <h2 id="more-on-the-sgd" class="anchor">More on the SGD </h2>
873873
the fact that \( \mu \) and \( \Sigma \) are &quot;encoding&quot; \( \boldsymbol{x} \) into the latent
874874
space \( \boldsymbol{h} \). The advantages of this choice are computational, as they
875875
make it clear how to compute the right hand side. The last
876-
term---\( \mathcal{D}\left[Q(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right] \)---is now a KL-divergence
876+
term---\( \mathcal{D}\left[q_{\boldsymbol{\phi}}(\boldsymbol{h}|\boldsymbol{x})\|p(\boldsymbol{h})\right] \)---is now a KL-divergence
877877
between two multivariate Gaussian distributions, which can be computed
878878
in closed form as:
879879
</p>
@@ -901,16 +901,16 @@ <h2 id="simplification" class="anchor">Simplification </h2>
901901
<h2 id="terms-to-compute" class="anchor">Terms to compute </h2>
902902

903903
<p>The first term on the right hand side is a bit more tricky.
904-
We could use sampling to estimate \( E_{z\sim Q}\left[\log P(X|z) \right] \), but getting a good estimate would require passing many samples of \( z \) through \( f \), which would be expensive.
905-
Hence, as is standard in stochastic gradient descent, we take one sample of \( z \) and treat \( \log P(X|z) \) for that \( z \) as an approximation of \( E_{z\sim Q}\left[\log P(X|z) \right] \).
906-
After all, we are already doing stochastic gradient descent over different values of \( X \) sampled from a dataset \( D \).
904+
We could use sampling to estimate \( E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p_{\Theta}(x|h) \right] \), but getting a good estimate would require passing many samples of \( h \) through some function \( f \), which would be expensive.
905+
Hence, as is standard in stochastic gradient descent, we take one sample of \( h \) and treat \( \log p(x|h) \) for that \( h \) as an approximation of \( E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p(x|h) \right] \).
906+
After all, we are already doing stochastic gradient descent over different values of \( x \) sampled from a dataset \( X \).
907907
The full equation we want to optimize is:
908908
</p>
909909

910910
$$
911911
\begin{array}{c}
912-
E_{X\sim D}\left[\log P(X) - \mathcal{D}\left[Q(z|X)\|P(z|X)\right]\right]=\hspace{16em}\\
913-
\hspace{10em}E_{X\sim D}\left[E_{z\sim Q}\left[\log P(X|z) \right] - \mathcal{D}\left[Q(z|X)\|P(z)\right]\right].
912+
E_{x\sim X}\left[\log p(X) - \mathcal{D}\left[q_{\boldsymbol{\phi}}(h|X)\|p(h|x)\right]\right]=\hspace{16em}\\
913+
\hspace{10em}E_{x\sim X}\left[E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p(x|h) \right] - \mathcal{D}\left[q_{\boldsymbol{\phi}}(h|x)\|p(h)\right]\right].
914914
\end{array}
915915
$$
916916

@@ -919,24 +919,24 @@ <h2 id="terms-to-compute" class="anchor">Terms to compute </h2>
919919
<h2 id="computing-the-gradients" class="anchor">Computing the gradients </h2>
920920

921921
<p>If we take the gradient of this equation, the gradient symbol can be moved into the expectations.
922-
Therefore, we can sample a single value of \( X \) and a single value of \( z \) from the distribution \( Q(z|X) \), and compute the gradient of:
922+
Therefore, we can sample a single value of \( x \) and a single value of \( h \) from the distribution \( q_{\boldsymbol{\phi}}(h|x) \), and compute the gradient of:
923923
</p>
924924
$$
925925
\begin{equation}
926-
\log P(X|z)-\mathcal{D}\left[Q(z|X)\|P(z)\right].
926+
\log p(x|h)-\mathcal{D}\left[q_{\boldsymbol{\phi}}(h|x)\|p(h)\right].
927927
\label{_auto1}
928928
\end{equation}
929929
$$
930930

931-
<p>We can then average the gradient of this function over arbitrarily many samples of \( X \) and \( z \), and the result converges to the gradient.</p>
931+
<p>We can then average the gradient of this function over arbitrarily many samples of \( x \) and \( h \), and the result converges to the gradient.</p>
932932

933933
<p>There is, however, a significant problem
934-
\( E_{z\sim Q}\left[\log P(X|z) \right] \) depends not just on the parameters of \( P \), but also on the parameters of \( Q \).
934+
\( E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p(x|h) \right] \) depends not just on the parameters of \( p \), but also on the parameters of \( q_{\boldsymbol{\phi}} \).
935935
</p>
936936

937-
<p>In order to make VAEs work, it is essential to drive \( Q \) to produce codes for \( X \) that \( P \) can reliably decode. </p>
937+
<p>In order to make VAEs work, it is essential to drive \( q_{\boldsymbol{\phi}} \) to produce codes for \( x \) that \( p \) can reliably decode. </p>
938938
$$
939-
E_{X\sim D}\left[E_{\epsilon\sim\mathcal{N}(0,I)}[\log P(X|z=\mu(X)+\Sigma^{1/2}(X)*\epsilon)]-\mathcal{D}\left[Q(z|X)\|P(z)\right]\right].
939+
E_{x\sim X}\left[E_{\epsilon\sim\mathcal{N}(0,I)}[\log p(x|h=\mu(X)+\Sigma^{1/2}(X)*\epsilon)]-\mathcal{D}\left[q_{\boldsymbol{\phi}}(h|x)\|p(h)\right]\right].
940940
$$
941941

942942

@@ -1318,11 +1318,11 @@ <h2 id="diffusion-models-basics" class="anchor">Diffusion models, basics </h2>
13181318
<h2 id="problems-with-probabilistic-models" class="anchor">Problems with probabilistic models </h2>
13191319

13201320
<p>Historically, probabilistic models suffer from a tradeoff between two
1321-
conflicting objectives: \textit{tractability} and
1322-
\textit{flexibility}. Models that are \textit{tractable} can be
1321+
conflicting objectives: <b>tractability</b> and
1322+
<b>flexibility</b>. Models that are <b>tractable</b> can be
13231323
analytically evaluated and easily fit to data (e.g. a Gaussian or
13241324
Laplace). However, these models are unable to aptly describe structure
1325-
in rich datasets. On the other hand, models that are \textit{flexible}
1325+
in rich datasets. On the other hand, models that are <b>flexible</b>
13261326
can be molded to fit structure in arbitrary data. For example, we can
13271327
define models in terms of any (non-negative) function \( \phi(\boldsymbol{x}) \)
13281328
yielding the flexible distribution \( p\left(\boldsymbol{x}\right) =
@@ -1622,7 +1622,7 @@ <h2 id="interpretations" class="anchor">Interpretations </h2>
16221622
<h2 id="the-last-term" class="anchor">The last term </h2>
16231623

16241624
<ul>
1625-
<li> \( \mathbb{E}_{q(\boldsymbol{x}_{t-1}, \boldsymbol{x}_{t+1}|\boldsymbol{x}_0)}\left[D_{KL}(q(\boldsymbol{x}_{t}|\boldsymbol{x}_{t-1})\vert\vert p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t}|\boldsymbol{x}_{t+1}))\right] \) is a \textit{consistency term}; it endeavors to make the distribution at \( \boldsymbol{x}_t \) consistent, from both forward and backward processes. That is, a denoising step from a noisier image should match the corresponding noising step from a cleaner image, for every intermediate timestep; this is reflected mathematically by the KL Divergence. This term is minimized when we train \( p_{\theta}(\boldsymbol{x}_t|\boldsymbol{x}_{t+1}) \) to match the Gaussian distribution \( q(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}) \).</li>
1625+
<li> \( \mathbb{E}_{q(\boldsymbol{x}_{t-1}, \boldsymbol{x}_{t+1}|\boldsymbol{x}_0)}\left[D_{KL}(q(\boldsymbol{x}_{t}|\boldsymbol{x}_{t-1})\vert\vert p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t}|\boldsymbol{x}_{t+1}))\right] \) is a <b>consistency term</b>; it endeavors to make the distribution at \( \boldsymbol{x}_t \) consistent, from both forward and backward processes. That is, a denoising step from a noisier image should match the corresponding noising step from a cleaner image, for every intermediate timestep; this is reflected mathematically by the KL Divergence. This term is minimized when we train \( p_{\theta}(\boldsymbol{x}_t|\boldsymbol{x}_{t+1}) \) to match the Gaussian distribution \( q(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}) \).</li>
16261626
</ul>
16271627
<!-- !split -->
16281628
<h2 id="diffusion-models-part-2-from-url-https-arxiv-org-abs-2208-11970" class="anchor">Diffusion models, part 2, from <a href="https://arxiv.org/abs/2208.11970" target="_self"><tt>https://arxiv.org/abs/2208.11970</tt></a> </h2>

0 commit comments

Comments
 (0)