Skip to content

Commit 58eb26c

Browse files
authored
Merge branch 'master' into master
2 parents f084d04 + 8a59855 commit 58eb26c

16 files changed

+1091
-3
lines changed

chapter_15/1_background.tex

+248
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
% Contributors: Suman Mulumudi, Jake Lee
2+
3+
\section{Background on Autoencoders and A Few Flavors}
4+
5+
Autoencoders are neural networks which perform dimensionality
6+
reduction in an unsupervised manner by compressing their inputs into
7+
a (generally lower-dimensional) latent space while mandating that the
8+
original input can be reconstructed. This turns out to be an
9+
extremely powerful architecture, enabling both the discovery of
10+
well-structured latent spaces and for the separation of signal from
11+
noise.
12+
13+
Formally, suppose an input space $X$, an encoder $f(x ; \theta_f): X
14+
\to H$, and a decoder $g(h ; \theta_g): H \to X$, where $H$ is a
15+
latent space. Further, suppose a regressive reconstruction loss $l(x,
16+
x'): X \times X \to R$. The autoencoder is then optimized as:
17+
18+
\begin{align*}
19+
\theta_f^*, \theta_g^* = \argmin_{\theta_f, \theta_g} \sum_i l(x_i, g(f(x_i)))
20+
\end{align*}
21+
22+
That is, the encoder $f$ transforms $x$ into a latent representation
23+
$h$, which is then transformed by $g$ back into the original domain.
24+
Generally, $H$ will be chosen to be of lower dimension than $X$ in
25+
order to facilitate dimensionality reduction, and the use of a
26+
reconstruction loss forces a representation in $H$ to retain
27+
information about the original input. Encoder and decoder functions
28+
can be as simple as linear layers, or as complex as deep neural
29+
networks with non-linear activations in order to learn richer or more
30+
abstract representations. Optimization of autoencoders is generally
31+
done by stochastic gradient descent.
32+
33+
To better understand the process of autoencoding, it may be helpful
34+
to view $f(x)$ as projecting points onto a manifold $H$, and $g(x)$
35+
as re-projecting from the manifold back into the original space; with
36+
this conceptualization, training an autoencoder is the process of
37+
finding a low-dimensional manifold which can best represent the data
38+
the data in $X$. Indeed, when the encoder and decoder are linear, the
39+
optimal autoencoder recovers the PCA subspace, which we know to be
40+
the optimal linear manifold for dimensionality reduction
41+
\cite{alain2014regularized}.
42+
43+
44+
In de-noising applications, the full auto-encoder $g(f(x))$ is used
45+
for downstream tasks. In generative applications, the decoder $g(h)$
46+
is used in downstream tasks. In dimensionality reduction and
47+
re-representation applications, the latent space representation
48+
$h=f(x)$ is useful for downstream tasks. These are discussed in more
49+
detail below.
50+
51+
52+
\subsection{Regularized Autoencoders}
53+
54+
Regularized autoencoders utilize regularization on the latent space,
55+
or by introducing noise into the input, to better constrain the
56+
geometry of the latent space.
57+
58+
One common regularized autoencoder is the Contracted Autoencoder
59+
(CAE). CAEs modify the autoencoder loss to:
60+
61+
\begin{align*}
62+
L_{CAE}(x, x') = \sum_i \left ( l(x_i, \hat{x}_i) + \lambda \left | \frac{\partial f(x)}{\partial x} \right |^2_F \right )
63+
\end{align*}
64+
65+
This additional term to regularize the encoder adds robustness to
66+
perturbations of the input space. In general, an autoencoder must
67+
learn representations which allow for differentiation of different
68+
datapoints and allow for reconstruction, so a strong contractive loss
69+
term (i.e. large $\lambda$) provides robustness by encouraging the
70+
autoencoder to be robust to directions in the input space which are
71+
not descriptive of the data (i.e. orthogonal to the manifold on which
72+
the data lie) \cite{alain2014regularized}.
73+
74+
75+
Another common autoencoder is the denoising autoencoder (DAE). Unlike
76+
other autoencoders, DAEs are used to separate data from noise by
77+
introducing stochastic noise into the input and regressing the
78+
uncorrupted output with the goal of producing an autoencoder which
79+
can denoise an input. That is, if $N(x)$ is a function which
80+
introduces stochastic noise, then a DAE has the form:
81+
82+
\begin{align*}
83+
\theta_f^*, \theta_g^* = \argmin_{\theta_f, \theta_g} \sum_i l(x_i, g(f(N(x_i))))
84+
\end{align*}
85+
86+
Indeed, the DAE can actually be seen as a form of a CAE, as it
87+
encourages robustness to small perturbations of the input, but
88+
applies regularization over the full encoder-decoder chain rather
89+
than just the encoder \cite{alain2014regularized}.
90+
91+
92+
93+
\subsection{Variational Autoencoders}
94+
95+
Variational autoencoders (VAEs) are a generative variant of
96+
traditional autoencoders that is regularized in to create a latent
97+
space which can be easily sampled to generate new examples similar to
98+
the original data. The primary intuition behind VAEs is to try to
99+
model the dataset as being sampled from a well structured latent
100+
distribution, generally a latent normal distribution $h \in H =
101+
N(0,I)$. Using this distribution as a prior, the VAE encoder then
102+
maps each datapoint $x \in X$ to a posterior distribution
103+
$N(\vec{\mu}, \Sigma)$ \cite{doersch2016tutorial}.
104+
105+
Formally, we are aiming to maximize the probability of the data
106+
$P(X)$ over a latent space $h \in H$:
107+
108+
\begin{align*}
109+
P(X) = \int P(X \mid h) P(h) dh
110+
\end{align*}
111+
112+
$P(h)$ is a model of the prior (which we will constrain to $N(0,I)$),
113+
and $P(X \mid h)$ models the posterior (which we will also constrain
114+
to a normal distribution), and the VAE attempts to optimize the
115+
posterior such that probability of all the data in the dataset is
116+
maximized. Optimizing this directly is generally computationally
117+
intractable because of the size of the latent space, but by observing
118+
that most latent variables $h$ will not produce a given datapoint $x$
119+
(i.e. each datapoint is mapped to a small region of the latent
120+
space), we can re-formulate the problem such that we sample a more
121+
limited region of the latent space. Namely, re-write the optimization
122+
in terms of the KL-divergence $D_{KL}$ and introducing an additional
123+
probability distribution, $Q$:
124+
125+
\begin{align*}
126+
\log P(X) - D_{KL}[Q(h \mid X) \mid P(h \mid X)] \\
127+
= E_{h \sim Q} [\log P(X \mid h)] - D_{KL}[Q(h \mid X) \mid P(h)]
128+
\end{align*}
129+
130+
(A formal derivation of this can be found in
131+
\cite{doersch2016tutorial} – the derivation is relatively simple, but
132+
is not in itself illustrative for intuition and is therefore omitted
133+
here.) In this formulation, the distribution $Q(h \mid X)$ is modeled
134+
by the encoder, and the distribution $P(X \mid h)$ is modeled by the
135+
decoder. Additionally, re-writing the optimization in terms of the
136+
KL-Divergence displays an important intuition: the KL-Divergence is a
137+
similarity measure between probability distributions, so any
138+
minimization over a KL-Divergence forces the two probability
139+
distributions to be similar.
140+
141+
Intuitively, $E_{h \sim Q} [\log P(X \mid h)]$ maximizes the
142+
probability of the data $X$ given $h \in Q$, as required by the
143+
original optimization problem. $D_{KL}[Q(h \mid X) \mid P(h)]$ forces
144+
the encoded latent space distribution for the data to be similar to
145+
the prior distribution, making it a reasonable distribution over
146+
which to sample. We also note that the $D_{KL}[Q(h \mid X) \mid P(h
147+
\mid X)]$ vanishes when $Q$ is sufficiently high-capacity (i.e. the
148+
encoder and decoder distributions will naturally tend to match during
149+
training), leaving a maximization over $\log P(X)$ as desired.
150+
151+
VAEs achieve model this by making two changes to the standard
152+
autoencoder: encoding each datapoint to a multivariate Gaussian
153+
distribution rather than a single point (i.e. the posterior), and
154+
applying a KL-divergence loss to the resulting distribution (i.e.
155+
enforcing the prior). Formally, the encoder is modified to output a
156+
$\vec{\mu}_i$ and $\Sigma_i$ of a distribution, and during training
157+
the decoder is fed a vector from the latent space sampled from the
158+
distribution $N(\vec{\mu}_i, \Sigma_i)$. That is, the encoder and
159+
decoder are re-formulated as:
160+
161+
\begin{align*}
162+
&f(x_i ;\theta_f) = \vec{\mu}_i, \Sigma_i \\
163+
&\gamma \in N(0,I) \\
164+
&g(\vec{\mu}_i + \gamma \Sigma_i; \theta_g) = \hat{x}_i \\
165+
\end{align*}
166+
167+
And the loss is reformulated as:
168+
169+
\begin{align*}
170+
L_{VAE}(x_i, \hat{x}) = l(x_i, \hat{x}_i) - D_{KL}[N(\vec{\mu}, \Sigma) \mid N(0,I)]
171+
\end{align*}
172+
173+
Where $l$ is again the reconstruction loss. Intuitively, the
174+
KL-divergence enforces the prior distribution and prevents each
175+
datapoint from collapsing to a single point representation in the
176+
latent space, and models $D_{KL}[Q(h \mid X) \mid P(h)]$. Further,
177+
computing the reconstruction loss over a single point sampled from
178+
the predicted distribution $N(\vec{\mu}_i, \Sigma_i)$ serves as a way
179+
to tractable approximate $E_{h \sim Q} [\log P(X \mid h)]$ and over
180+
which one can apply stochastic gradient descent. (Note: $\Sigma$ is
181+
also often constrained to be diagonal to ease computation of the
182+
KL-Divergence.)
183+
184+
Once such an autoencoder is trained, the latent space can be sampled
185+
to generate new outputs from the original distribution by producing
186+
vectors $h \in N(0,I)$ and feeding these values $h$ to the decoder
187+
$g(h)$.%Jake
188+
189+
\subsection{Conditional VAE}
190+
191+
While a VAE could generate outputs from the overall original
192+
distribution by feeding latent space samples to the decoder, in
193+
certain cases it may become necessary to be more specific in this
194+
data generation process. For example, if trained on a dataset of
195+
digit images, we may want to generate images of a specific digit
196+
instead of any digit from the overall distribution. To accomplish
197+
this, we turn to the Conditional VAE.
198+
199+
Whereas a standard VAE encoder and decoder only conditions on the
200+
input space and latent space respectively, a Conditional VAE, also
201+
conditions both on an additional variable $c$. That is, the encoder
202+
and decoder are re-formulated as:
203+
204+
\begin{align*}
205+
&f(x_i, c ;\theta_f) = \vec{\mu}_i, \Sigma_i \\
206+
&\gamma \in N(0,I) \\
207+
&g(\vec{\mu}_i + \gamma \Sigma_i, c; \theta_g) = \hat{x}_i \\
208+
\end{align*}
209+
210+
And the optimization becomes similarly conditioned on $c$. This
211+
conditioning is simply accomplished by concatenating $c$ to the input
212+
vector and latent vector for the encoder and decoder, respectively.
213+
Now that the latent variable is distributed conditioned on $c$, we
214+
can now feed the decoder a randomly sampled latent space vector and
215+
specify a concatenated $c$ to generate from the specified
216+
distribution.
217+
218+
\subsection{$\beta$-VAE}
219+
220+
While a VAE will learn a latent representation ideal for decoding,
221+
there is no guarantee on the \textit{disentanglement} of the learned
222+
features. If each variable in a latent representation is sensitive to
223+
changes in a single generative factor, while being relatively
224+
invariant to changes in other factors, then it is referred to as
225+
\textit{disentangled}. We introduce the Beta VAE, which encourages
226+
the learning of such representations \cite{higgins2017beta}. This is
227+
particularly useful for interpretability of the latent space, as well
228+
as fine-grained manipulation of generated outputs. For example, a
229+
Beta VAE trained on images of faces may be able to modify a single
230+
factor such as hair color or mouth shape in a generated image by
231+
modifying a single value in the latent space representation.
232+
233+
To achieve this during training, a single hyperparameter $\beta$ is
234+
introduced. When $\beta = 1$, the Beta VAE behaves equally to a
235+
standard VAE. When $\beta > 1$ however, the capacity of the learning
236+
channel is limited, and the network is forced to learn a more
237+
efficient latent representation that is hopefully more disentangled.
238+
This optimization is defined as:
239+
240+
\begin{align*}
241+
L_{\beta \text{-VAE}}(x_i, \hat{x}_i) = l(x_i, \hat{x}_i) - \beta D_{KL}[N(\vec{\mu}_i, \Sigma_i) \mid N(0,I)]
242+
\end{align*}
243+
244+
This simple addition of the $\beta$ coefficient leads to a more
245+
disentangled latent representation, but may come with a trade-off
246+
with reconstruction fidelity. Therefore, a $\beta$ value must be
247+
found with the best balance between the disentanglement of the latent
248+
space and the quality of decoding.

0 commit comments

Comments
 (0)