You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<h2id="terms-to-compute" class="anchor">Terms to compute </h2>
902
902
903
903
<p>The first term on the right hand side is a bit more tricky.
904
-
We could use sampling to estimate \( E_{z\sim Q}\left[\log P(X|z) \right] \), but getting a good estimate would require passing many samples of \( z \) through \( f \), which would be expensive.
905
-
Hence, as is standard in stochastic gradient descent, we take one sample of \( z \) and treat \( \log P(X|z) \) for that \( z \) as an approximation of \( E_{z\sim Q}\left[\log P(X|z) \right] \).
906
-
After all, we are already doing stochastic gradient descent over different values of \( X \) sampled from a dataset \( D \).
904
+
We could use sampling to estimate \( E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p_{\Theta}(x|h) \right] \), but getting a good estimate would require passing many samples of \( h \) through some function \( f \), which would be expensive.
905
+
Hence, as is standard in stochastic gradient descent, we take one sample of \( h \) and treat \( \log p(x|h) \) for that \( h \) as an approximation of \( E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p(x|h) \right] \).
906
+
After all, we are already doing stochastic gradient descent over different values of \( x \) sampled from a dataset \( X \).
@@ -919,24 +919,24 @@ <h2 id="terms-to-compute" class="anchor">Terms to compute </h2>
919
919
<h2id="computing-the-gradients" class="anchor">Computing the gradients </h2>
920
920
921
921
<p>If we take the gradient of this equation, the gradient symbol can be moved into the expectations.
922
-
Therefore, we can sample a single value of \( X \) and a single value of \( z \) from the distribution \( Q(z|X) \), and compute the gradient of:
922
+
Therefore, we can sample a single value of \( x \) and a single value of \( h \) from the distribution \( q_{\boldsymbol{\phi}}(h|x) \), and compute the gradient of:
<p>We can then average the gradient of this function over arbitrarily many samples of \( X \) and \( z \), and the result converges to the gradient.</p>
931
+
<p>We can then average the gradient of this function over arbitrarily many samples of \( x \) and \( h \), and the result converges to the gradient.</p>
932
932
933
933
<p>There is, however, a significant problem
934
-
\( E_{z\sim Q}\left[\log P(X|z) \right] \) depends not just on the parameters of \( P \), but also on the parameters of \( Q \).
934
+
\( E_{h\sim q_{\boldsymbol{\phi}}}\left[\log p(x|h) \right] \) depends not just on the parameters of \( p \), but also on the parameters of \( q_{\boldsymbol{\phi}} \).
935
935
</p>
936
936
937
-
<p>In order to make VAEs work, it is essential to drive \( Q \) to produce codes for \( X \) that \( P \) can reliably decode. </p>
937
+
<p>In order to make VAEs work, it is essential to drive \( q_{\boldsymbol{\phi}} \) to produce codes for \( x \) that \( p \) can reliably decode. </p>
<h2id="the-last-term" class="anchor">The last term </h2>
1623
1623
1624
1624
<ul>
1625
-
<li> \( \mathbb{E}_{q(\boldsymbol{x}_{t-1}, \boldsymbol{x}_{t+1}|\boldsymbol{x}_0)}\left[D_{KL}(q(\boldsymbol{x}_{t}|\boldsymbol{x}_{t-1})\vert\vert p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t}|\boldsymbol{x}_{t+1}))\right] \) is a \textit{consistency term}; it endeavors to make the distribution at \( \boldsymbol{x}_t \) consistent, from both forward and backward processes. That is, a denoising step from a noisier image should match the corresponding noising step from a cleaner image, for every intermediate timestep; this is reflected mathematically by the KL Divergence. This term is minimized when we train \( p_{\theta}(\boldsymbol{x}_t|\boldsymbol{x}_{t+1}) \) to match the Gaussian distribution \( q(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}) \).</li>
1625
+
<li> \( \mathbb{E}_{q(\boldsymbol{x}_{t-1}, \boldsymbol{x}_{t+1}|\boldsymbol{x}_0)}\left[D_{KL}(q(\boldsymbol{x}_{t}|\boldsymbol{x}_{t-1})\vert\vert p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t}|\boldsymbol{x}_{t+1}))\right] \) is a <b>consistency term</b>; it endeavors to make the distribution at \( \boldsymbol{x}_t \) consistent, from both forward and backward processes. That is, a denoising step from a noisier image should match the corresponding noising step from a cleaner image, for every intermediate timestep; this is reflected mathematically by the KL Divergence. This term is minimized when we train \( p_{\theta}(\boldsymbol{x}_t|\boldsymbol{x}_{t+1}) \) to match the Gaussian distribution \( q(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}) \).</li>
1626
1626
</ul>
1627
1627
<!-- !split -->
1628
1628
<h2id="diffusion-models-part-2-from-url-https-arxiv-org-abs-2208-11970" class="anchor">Diffusion models, part 2, from <ahref="https://arxiv.org/abs/2208.11970" target="_self"><tt>https://arxiv.org/abs/2208.11970</tt></a></h2>
0 commit comments