|
| 1 | +% Contributors: Suman Mulumudi, Jake Lee |
| 2 | + |
| 3 | +\section{Background on Autoencoders and A Few Flavors} |
| 4 | + |
| 5 | +Autoencoders are neural networks which perform dimensionality |
| 6 | +reduction in an unsupervised manner by compressing their inputs into |
| 7 | +a (generally lower-dimensional) latent space while mandating that the |
| 8 | +original input can be reconstructed. This turns out to be an |
| 9 | +extremely powerful architecture, enabling both the discovery of |
| 10 | +well-structured latent spaces and for the separation of signal from |
| 11 | +noise. |
| 12 | + |
| 13 | +Formally, suppose an input space $X$, an encoder $f(x ; \theta_f): X |
| 14 | +\to H$, and a decoder $g(h ; \theta_g): H \to X$, where $H$ is a |
| 15 | +latent space. Further, suppose a regressive reconstruction loss $l(x, |
| 16 | +x'): X \times X \to R$. The autoencoder is then optimized as: |
| 17 | + |
| 18 | +\begin{align*} |
| 19 | + \theta_f^*, \theta_g^* = \argmin_{\theta_f, \theta_g} \sum_i l(x_i, g(f(x_i))) |
| 20 | +\end{align*} |
| 21 | + |
| 22 | +That is, the encoder $f$ transforms $x$ into a latent representation |
| 23 | +$h$, which is then transformed by $g$ back into the original domain. |
| 24 | +Generally, $H$ will be chosen to be of lower dimension than $X$ in |
| 25 | +order to facilitate dimensionality reduction, and the use of a |
| 26 | +reconstruction loss forces a representation in $H$ to retain |
| 27 | +information about the original input. Encoder and decoder functions |
| 28 | +can be as simple as linear layers, or as complex as deep neural |
| 29 | +networks with non-linear activations in order to learn richer or more |
| 30 | +abstract representations. Optimization of autoencoders is generally |
| 31 | +done by stochastic gradient descent. |
| 32 | + |
| 33 | +To better understand the process of autoencoding, it may be helpful |
| 34 | +to view $f(x)$ as projecting points onto a manifold $H$, and $g(x)$ |
| 35 | +as re-projecting from the manifold back into the original space; with |
| 36 | +this conceptualization, training an autoencoder is the process of |
| 37 | +finding a low-dimensional manifold which can best represent the data |
| 38 | +the data in $X$. Indeed, when the encoder and decoder are linear, the |
| 39 | +optimal autoencoder recovers the PCA subspace, which we know to be |
| 40 | +the optimal linear manifold for dimensionality reduction |
| 41 | +\cite{alain2014regularized}. |
| 42 | + |
| 43 | + |
| 44 | +In de-noising applications, the full auto-encoder $g(f(x))$ is used |
| 45 | +for downstream tasks. In generative applications, the decoder $g(h)$ |
| 46 | +is used in downstream tasks. In dimensionality reduction and |
| 47 | +re-representation applications, the latent space representation |
| 48 | +$h=f(x)$ is useful for downstream tasks. These are discussed in more |
| 49 | +detail below. |
| 50 | + |
| 51 | + |
| 52 | +\subsection{Regularized Autoencoders} |
| 53 | + |
| 54 | +Regularized autoencoders utilize regularization on the latent space, |
| 55 | +or by introducing noise into the input, to better constrain the |
| 56 | +geometry of the latent space. |
| 57 | + |
| 58 | +One common regularized autoencoder is the Contracted Autoencoder |
| 59 | +(CAE). CAEs modify the autoencoder loss to: |
| 60 | + |
| 61 | +\begin{align*} |
| 62 | + L_{CAE}(x, x') = \sum_i \left ( l(x_i, \hat{x}_i) + \lambda \left | \frac{\partial f(x)}{\partial x} \right |^2_F \right ) |
| 63 | +\end{align*} |
| 64 | + |
| 65 | +This additional term to regularize the encoder adds robustness to |
| 66 | +perturbations of the input space. In general, an autoencoder must |
| 67 | +learn representations which allow for differentiation of different |
| 68 | +datapoints and allow for reconstruction, so a strong contractive loss |
| 69 | +term (i.e. large $\lambda$) provides robustness by encouraging the |
| 70 | +autoencoder to be robust to directions in the input space which are |
| 71 | +not descriptive of the data (i.e. orthogonal to the manifold on which |
| 72 | +the data lie) \cite{alain2014regularized}. |
| 73 | + |
| 74 | + |
| 75 | +Another common autoencoder is the denoising autoencoder (DAE). Unlike |
| 76 | +other autoencoders, DAEs are used to separate data from noise by |
| 77 | +introducing stochastic noise into the input and regressing the |
| 78 | +uncorrupted output with the goal of producing an autoencoder which |
| 79 | +can denoise an input. That is, if $N(x)$ is a function which |
| 80 | +introduces stochastic noise, then a DAE has the form: |
| 81 | + |
| 82 | +\begin{align*} |
| 83 | + \theta_f^*, \theta_g^* = \argmin_{\theta_f, \theta_g} \sum_i l(x_i, g(f(N(x_i)))) |
| 84 | +\end{align*} |
| 85 | + |
| 86 | +Indeed, the DAE can actually be seen as a form of a CAE, as it |
| 87 | +encourages robustness to small perturbations of the input, but |
| 88 | +applies regularization over the full encoder-decoder chain rather |
| 89 | +than just the encoder \cite{alain2014regularized}. |
| 90 | + |
| 91 | + |
| 92 | + |
| 93 | +\subsection{Variational Autoencoders} |
| 94 | + |
| 95 | +Variational autoencoders (VAEs) are a generative variant of |
| 96 | +traditional autoencoders that is regularized in to create a latent |
| 97 | +space which can be easily sampled to generate new examples similar to |
| 98 | +the original data. The primary intuition behind VAEs is to try to |
| 99 | +model the dataset as being sampled from a well structured latent |
| 100 | +distribution, generally a latent normal distribution $h \in H = |
| 101 | +N(0,I)$. Using this distribution as a prior, the VAE encoder then |
| 102 | +maps each datapoint $x \in X$ to a posterior distribution |
| 103 | +$N(\vec{\mu}, \Sigma)$ \cite{doersch2016tutorial}. |
| 104 | + |
| 105 | +Formally, we are aiming to maximize the probability of the data |
| 106 | +$P(X)$ over a latent space $h \in H$: |
| 107 | + |
| 108 | +\begin{align*} |
| 109 | + P(X) = \int P(X \mid h) P(h) dh |
| 110 | +\end{align*} |
| 111 | + |
| 112 | +$P(h)$ is a model of the prior (which we will constrain to $N(0,I)$), |
| 113 | +and $P(X \mid h)$ models the posterior (which we will also constrain |
| 114 | +to a normal distribution), and the VAE attempts to optimize the |
| 115 | +posterior such that probability of all the data in the dataset is |
| 116 | +maximized. Optimizing this directly is generally computationally |
| 117 | +intractable because of the size of the latent space, but by observing |
| 118 | +that most latent variables $h$ will not produce a given datapoint $x$ |
| 119 | +(i.e. each datapoint is mapped to a small region of the latent |
| 120 | +space), we can re-formulate the problem such that we sample a more |
| 121 | +limited region of the latent space. Namely, re-write the optimization |
| 122 | +in terms of the KL-divergence $D_{KL}$ and introducing an additional |
| 123 | +probability distribution, $Q$: |
| 124 | + |
| 125 | +\begin{align*} |
| 126 | + \log P(X) - D_{KL}[Q(h \mid X) \mid P(h \mid X)] \\ |
| 127 | + = E_{h \sim Q} [\log P(X \mid h)] - D_{KL}[Q(h \mid X) \mid P(h)] |
| 128 | +\end{align*} |
| 129 | + |
| 130 | +(A formal derivation of this can be found in |
| 131 | +\cite{doersch2016tutorial} – the derivation is relatively simple, but |
| 132 | +is not in itself illustrative for intuition and is therefore omitted |
| 133 | +here.) In this formulation, the distribution $Q(h \mid X)$ is modeled |
| 134 | +by the encoder, and the distribution $P(X \mid h)$ is modeled by the |
| 135 | +decoder. Additionally, re-writing the optimization in terms of the |
| 136 | +KL-Divergence displays an important intuition: the KL-Divergence is a |
| 137 | +similarity measure between probability distributions, so any |
| 138 | +minimization over a KL-Divergence forces the two probability |
| 139 | +distributions to be similar. |
| 140 | + |
| 141 | +Intuitively, $E_{h \sim Q} [\log P(X \mid h)]$ maximizes the |
| 142 | +probability of the data $X$ given $h \in Q$, as required by the |
| 143 | +original optimization problem. $D_{KL}[Q(h \mid X) \mid P(h)]$ forces |
| 144 | +the encoded latent space distribution for the data to be similar to |
| 145 | +the prior distribution, making it a reasonable distribution over |
| 146 | +which to sample. We also note that the $D_{KL}[Q(h \mid X) \mid P(h |
| 147 | +\mid X)]$ vanishes when $Q$ is sufficiently high-capacity (i.e. the |
| 148 | +encoder and decoder distributions will naturally tend to match during |
| 149 | +training), leaving a maximization over $\log P(X)$ as desired. |
| 150 | + |
| 151 | +VAEs achieve model this by making two changes to the standard |
| 152 | +autoencoder: encoding each datapoint to a multivariate Gaussian |
| 153 | +distribution rather than a single point (i.e. the posterior), and |
| 154 | +applying a KL-divergence loss to the resulting distribution (i.e. |
| 155 | +enforcing the prior). Formally, the encoder is modified to output a |
| 156 | +$\vec{\mu}_i$ and $\Sigma_i$ of a distribution, and during training |
| 157 | +the decoder is fed a vector from the latent space sampled from the |
| 158 | +distribution $N(\vec{\mu}_i, \Sigma_i)$. That is, the encoder and |
| 159 | +decoder are re-formulated as: |
| 160 | + |
| 161 | +\begin{align*} |
| 162 | + &f(x_i ;\theta_f) = \vec{\mu}_i, \Sigma_i \\ |
| 163 | + &\gamma \in N(0,I) \\ |
| 164 | + &g(\vec{\mu}_i + \gamma \Sigma_i; \theta_g) = \hat{x}_i \\ |
| 165 | +\end{align*} |
| 166 | + |
| 167 | +And the loss is reformulated as: |
| 168 | + |
| 169 | +\begin{align*} |
| 170 | + L_{VAE}(x_i, \hat{x}) = l(x_i, \hat{x}_i) - D_{KL}[N(\vec{\mu}, \Sigma) \mid N(0,I)] |
| 171 | +\end{align*} |
| 172 | + |
| 173 | +Where $l$ is again the reconstruction loss. Intuitively, the |
| 174 | +KL-divergence enforces the prior distribution and prevents each |
| 175 | +datapoint from collapsing to a single point representation in the |
| 176 | +latent space, and models $D_{KL}[Q(h \mid X) \mid P(h)]$. Further, |
| 177 | +computing the reconstruction loss over a single point sampled from |
| 178 | +the predicted distribution $N(\vec{\mu}_i, \Sigma_i)$ serves as a way |
| 179 | +to tractable approximate $E_{h \sim Q} [\log P(X \mid h)]$ and over |
| 180 | +which one can apply stochastic gradient descent. (Note: $\Sigma$ is |
| 181 | +also often constrained to be diagonal to ease computation of the |
| 182 | +KL-Divergence.) |
| 183 | + |
| 184 | +Once such an autoencoder is trained, the latent space can be sampled |
| 185 | +to generate new outputs from the original distribution by producing |
| 186 | +vectors $h \in N(0,I)$ and feeding these values $h$ to the decoder |
| 187 | +$g(h)$.%Jake |
| 188 | + |
| 189 | +\subsection{Conditional VAE} |
| 190 | + |
| 191 | +While a VAE could generate outputs from the overall original |
| 192 | +distribution by feeding latent space samples to the decoder, in |
| 193 | +certain cases it may become necessary to be more specific in this |
| 194 | +data generation process. For example, if trained on a dataset of |
| 195 | +digit images, we may want to generate images of a specific digit |
| 196 | +instead of any digit from the overall distribution. To accomplish |
| 197 | +this, we turn to the Conditional VAE. |
| 198 | + |
| 199 | +Whereas a standard VAE encoder and decoder only conditions on the |
| 200 | +input space and latent space respectively, a Conditional VAE, also |
| 201 | +conditions both on an additional variable $c$. That is, the encoder |
| 202 | +and decoder are re-formulated as: |
| 203 | + |
| 204 | +\begin{align*} |
| 205 | + &f(x_i, c ;\theta_f) = \vec{\mu}_i, \Sigma_i \\ |
| 206 | + &\gamma \in N(0,I) \\ |
| 207 | + &g(\vec{\mu}_i + \gamma \Sigma_i, c; \theta_g) = \hat{x}_i \\ |
| 208 | +\end{align*} |
| 209 | + |
| 210 | +And the optimization becomes similarly conditioned on $c$. This |
| 211 | +conditioning is simply accomplished by concatenating $c$ to the input |
| 212 | +vector and latent vector for the encoder and decoder, respectively. |
| 213 | +Now that the latent variable is distributed conditioned on $c$, we |
| 214 | +can now feed the decoder a randomly sampled latent space vector and |
| 215 | +specify a concatenated $c$ to generate from the specified |
| 216 | +distribution. |
| 217 | + |
| 218 | +\subsection{$\beta$-VAE} |
| 219 | + |
| 220 | +While a VAE will learn a latent representation ideal for decoding, |
| 221 | +there is no guarantee on the \textit{disentanglement} of the learned |
| 222 | +features. If each variable in a latent representation is sensitive to |
| 223 | +changes in a single generative factor, while being relatively |
| 224 | +invariant to changes in other factors, then it is referred to as |
| 225 | +\textit{disentangled}. We introduce the Beta VAE, which encourages |
| 226 | +the learning of such representations \cite{higgins2017beta}. This is |
| 227 | +particularly useful for interpretability of the latent space, as well |
| 228 | +as fine-grained manipulation of generated outputs. For example, a |
| 229 | +Beta VAE trained on images of faces may be able to modify a single |
| 230 | +factor such as hair color or mouth shape in a generated image by |
| 231 | +modifying a single value in the latent space representation. |
| 232 | + |
| 233 | +To achieve this during training, a single hyperparameter $\beta$ is |
| 234 | +introduced. When $\beta = 1$, the Beta VAE behaves equally to a |
| 235 | +standard VAE. When $\beta > 1$ however, the capacity of the learning |
| 236 | +channel is limited, and the network is forced to learn a more |
| 237 | +efficient latent representation that is hopefully more disentangled. |
| 238 | +This optimization is defined as: |
| 239 | + |
| 240 | +\begin{align*} |
| 241 | + L_{\beta \text{-VAE}}(x_i, \hat{x}_i) = l(x_i, \hat{x}_i) - \beta D_{KL}[N(\vec{\mu}_i, \Sigma_i) \mid N(0,I)] |
| 242 | +\end{align*} |
| 243 | + |
| 244 | +This simple addition of the $\beta$ coefficient leads to a more |
| 245 | +disentangled latent representation, but may come with a trade-off |
| 246 | +with reconstruction fidelity. Therefore, a $\beta$ value must be |
| 247 | +found with the best balance between the disentanglement of the latent |
| 248 | +space and the quality of decoding. |
0 commit comments