Skip to content

benclarkegithub/dissertation

Repository files navigation

Coarse-to-fine Perceptual Decomposition With Deep Generative Models

Dissertation project for Artificial Intelligence (MSc) at The University of Edinburgh.

Supervisor: Siddharth N.

One of the fundamental research goals of artificial intelligence is knowledge representation and compression. At the heart of this are deep generative models (DGMs), models which approximate highly complicated and intractable probability distributions. However, training DGMs to organise latent variables hierarchically, variables which represent high-level information about the data, remains an open problem. This project investigates novel approaches to coarse to fine perceptual decomposition.

Introduction

This project aims to explore lossy compression techniques using deep generative models (DGMs). DGMs are powerful tools used to model complex, high-dimensional probability distributions, and are able to estimate the likelihood of, represent, and generate data. The goal of this project is to investigate hierarchical compression, a technique that involves compressing data from high to low levels, also known as coarse-to-fine compression.

The motivation behind hierarchical compression is to discard information that is conceptually redundant and not essential for maintaining the perceptual quality of the image, while still retaining its most important features. This research area is relatively unexplored, and it is rare to find DGMs that function with a variable number of hidden components, also known as latent representations.

The project will address this gap in knowledge by studying how DGMs can represent data to compress it hierarchically. This involves understanding how the model can learn to extract and represent the most salient features of an image at multiple levels of abstraction. The results of this research could have significant implications for a variety of fields in artificial intelligence and data analysis.

In addition, the project will explore the interrelated topics of linear factor models and representation learning. Linear factor models, such as principal component analysis (PCA), have been used to efficiently represent data by exploiting correlations between different dimensions. This allows for high-dimensional data to be represented by low-dimensional data and is more effective the more correlated the dimensions are. For example, PCA has been used to represent 3D human body shapes by exploiting correlations between different features such as fat/thin and tall/short.

Representation learning involves learning representations of the data that are useful for subsequent processing, such as classification or clustering. DGMs have been shown to be effective at learning representations that capture the underlying structure of the data. By studying how DGMs can be used for hierarchical compression, the project aims to contribute to the field of representation learning and develop techniques that can be applied to a wide range of problems in data analysis and artificial intelligence.

Background & Related Work

Variational Autoencoders

This project uses the Variational Autoencoder (VAE) as the deep generative model. The VAE is a type of deep learning model that assumes independently and identically distributed data (i.i.d.) $\mathbf{x}$ is generated by some random process $\mathbf{x} \sim p_{\theta^*}(\mathbf{x} \mid \mathbf{z})$, involving a continuous random variable $\mathbf{z}$, called a latent variable, which represents high-level components in the data.

VAEs offer an efficient solution to approximate three important quantities:

  1. The maximum likelihood estimate of the parameters $\theta$, allowing artificial data to be generated by sampling $\mathbf{x} \sim p_\theta(\mathbf{x} \mid \mathbf{z})$.
  2. The posterior inference of the latent variable $\mathbf{z}$ given data $\mathbf{x}$, $p_\theta(\mathbf{z} \mid \mathbf{x})$, which is useful for knowledge representation.
  3. The marginal probability of the variable $\mathbf{x}$, $p_\theta(\mathbf{x})$, which can be used to determine the likelihood of the data.

The VAE jointly learns a deep latent variable model (DLVM) $p_\theta(\mathbf{x}, \mathbf{z}) = p_\theta(\mathbf{x} \mid \mathbf{z}) p_\theta(\mathbf{z})$, called the decoder or generative model, and a corresponding inference model $p_\theta(\mathbf{z} \mid \mathbf{x})$. VAEs solve a key problem with DLVMs, where the marginal $p_\theta(\mathbf{x})$ is intractable due to the integral $p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}, \mathbf{z})\ d\mathbf{x}$. VAEs introduce an approximate inference model $q_\phi(\mathbf{z} \mid \mathbf{x}) \approx p_\theta(\mathbf{z} \mid \mathbf{x})$, called the encoder or recognition model, which enables the marginal $p_\theta(\mathbf{x})$ to be estimated and optimized by $p_\theta(\mathbf{x}) = p_\theta(\mathbf{x}, \mathbf{z}) / q_\phi(\mathbf{z} \mid \mathbf{x})$.

In practice, a VAE consists of an encoder that transforms an input $\mathbf{x}$ into a latent code $\mathbf{z}$, a decoder that maps the latent code back to an output $\mathbf{x'}$, and a loss function that compares $\mathbf{x}$ and $\mathbf{x'}$. During training, VAEs optimize a lower bound on the marginal likelihood of the data, called the Evidence Lower Bound (ELBO), using stochastic gradient descent. The ELBO has two components, the reconstruction term and the KL divergence (respectively):

$$\mathcal{L} (\theta, \phi; \mathbf{x}) = \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - D_{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \mid\mid p_\theta(\mathbf{z}))$$

In summary, VAEs provide a powerful tool for modeling complex high-dimensional data by learning a lower-dimensional representation that captures the salient features of the data.

Related Work Summary & Contribution

Towards Conceptual Compression introduces Convolutional DRAW, an RNN VAE that generates images over a flexible number of time steps. However, by using an RNN architecture, the latent representations lose meaning as the generation is dependent on the latent representations and previous hidden states. Moreover, generating images has a complexity of O(t). On the other hand, Principal Component Analysis Autoencoder introduces PCA-AE, an autoencoder method inspired by PCA. PCA-AE has meaningful latent representations, but requires multiple encoders to do so. Furthermore, PCA-AE does not support a flexible number of time steps. At generation, the decoder must receive the number of latent variables it was trained with.

This project's aim is to get the best of both worlds, under the assumption that the following features are desirable in a model:

  1. High and low-level, or coarse-to-fine features of the data are naturally separated.
  2. The latent representations are meaningful.
  3. Generation is possible with an arbitrary number of latent representations (although not an arbitrary order).
  4. Only a single encoder and decoder are necessary.
  5. Generation takes O(1) time.

Following from this, there are two main contributions of this project. Firstly, a novel Modular VAE architecture has been developed that allows for 2, 3, 4, and 5. Secondly, multiple methods have been developed that achieve 1 and 2, utilizing the Modular VAE architecture. A summary table of Convolutional DRAW, PCA-AE, and the methods developed in this thesis can be found below:

Convolutional DRAW PCA-AE This work
High and low-level ✔️ ✔️ ✔️
Meaningful latent representations ✔️ ✔️
Flexible number of latents ✔️ ✔️
Single encoder and decoder ✔️ ✔️
O(1) generation ✔️ ✔️

Method

Framework

The methods investigated in this project can be explained at a high-level in terms of a diagram, pictured below. Here, $\mathbf{x}$ is the data point to be reconstructed, $\mathbf{r}_ t$ and $\mathbf{z}_ t$ are the reconstruction and latent variable group at time step $t$, respectively, and $T$ is the maximum number of time steps of the model. The latent variable groups can have one or more latent variables defined by $\dim(\mathbf{z})$, and all have the same number of latent variables. Both $\dim(\mathbf{z})$ and $T$ are design choices. Notably, the total number of latent variables in the model is $\dim(\mathbf{z}) * T$.

Coarse-to-fine perceptual decomposition framework

The high-level coarse-to-fine framework this work considers. Image taken from https://www.robots.ox.ac.uk/~nsid/notes/c2f-vae.html. Note: "l" and "L" in the diagram are equivalent to "t" and "T".

Modular VAE

The equations for the Standard VAE architecture can be found below:

$$\displaylines{\mu, \log(\sigma^2) = \text{Encoder}_\phi(\mathbf{x})\\\ \mathbf{z} = \mu + \sigma \odot \epsilon,\ \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbb{I})\\\ \mathbf{r} = \text{Decoder}_\theta(\mathbf{z})}$$

In order to have the desired qualities and circumvent the issues of the VAE, a modular VAE architecture was developed for the project. Whereas a standard VAE has two components, an encoder and a decoder, the modular VAE has four components: an encoder, a component called $\text{EncoderToLatents}$, a component called $\text{LatentsToDecoder}$, and a decoder. Please see the equations and diagram below for the modular VAE architecture:

$$\displaylines{x_{enc} = \text{Encoder}_\phi(x)\\\ \mu_t, \log(\sigma^2_t) = \text{EncoderToLatents}_{\psi_t}(x_{enc})\\\ z_t = \mu_t + \sigma_t \odot \epsilon,\ \epsilon \sim \mathcal{N}(0, \mathbb{I})\\\ z_{dec} = \sum_{i=1}^t \text{LatentsToDecoder}_{\omega_i}(z_i)\\\ r = \text{Decoder}_\theta(z_{dec})}$$

Coarse-to-fine perceptual decomposition framework

A diagram of the Modular VAE architecture.

Variable Number of Steps

A desireable feature of a coarse-to-fine perceptual decomposition model is that it has the ability predict the number of steps $t$ the model should iterate to reconstruct the target image. Within the modular VAE framework an additional $\text{EncoderToLatents}$ component can be introduced to parameterise this. E.g., for a Gaussian distribution where $\mu_s$ and $\log(\sigma^2_s)$ denote the distribution of the number of steps for a target image:

$$\mu_s, \log(\sigma^2_s) = \text{EncoderToLatents}_{\psi_s}(x_{enc})$$

Then, training the component can be achieved by adding a new term to the loss:

$$\displaylines{\mathcal{L}_s (\theta, \phi, \theta_s, \phi_s, \psi_s; \mathbf{x}) = \sum_{i=1}^t P_s(\psi_s; \mu_s, \sigma_s) \mathbb{E}_{q_\phi(\mathbf{z_1, ..., z_i} \mid \mathbf{x})}[\log p_\theta(\mathbf{x} \mid \mathbf{z_1, ..., z_i})] - D_{KL}(q_{\phi_s}(s \mid \mathbf{x}) \mid\mid p_{\theta_s}(s))\\\ P_s(\psi_s; \mu_s, \sigma_s) = \begin{cases} \Phi(\frac{(1.5)-\mu_s}{\sigma_s}) & \text{if \$i\$ = 1}\\\ \Phi(\frac{(i+0.5)-\mu_s}{\sigma_s}) - \Phi(\frac{(i-0.5)-\mu_s}{\sigma_s}) & \text{if 1 < \$i\$ < t}\\\ 1 - \Phi(\frac{(t-0.5)-\mu_s}{\sigma_s}) & \text{if \$i\$ = t} \end{cases} }$$

Where the parameters subscripted by $s$ relate to the number of steps, $\Phi(x)$ is the CDF of the standard Gaussian distribution at $x$, and $p_{\theta_s}(s)$ is typically $\mathcal{N}(s; 0, 1)$, although it is a design choice. Moreover, a beta term $\beta_s$ can be added to the KL Divergence to increase/decrease the strength of the regularisation. This new term is then added to the ELBO:

$$\mathcal{L} (\theta, \phi; \mathbf{x}) = \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - D_{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \mid\mid p_\theta(\mathbf{z})) + \mathcal{L}_s (\theta, \phi, \theta_s, \phi_s, \psi_s; \mathbf{x})$$

In practice, PyTorch's implementation is used for the CDF to allow for gradient decent, and the expectation term is detached from the graph.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published