Dissertation project for Artificial Intelligence (MSc) at The University of Edinburgh.
Supervisor: Siddharth N.
One of the fundamental research goals of artificial intelligence is knowledge representation and compression. At the heart of this are deep generative models (DGMs), models which approximate highly complicated and intractable probability distributions. However, training DGMs to organise latent variables hierarchically, variables which represent high-level information about the data, remains an open problem. This project investigates novel approaches to coarse to fine perceptual decomposition.
This project aims to explore lossy compression techniques using deep generative models (DGMs). DGMs are powerful tools used to model complex, high-dimensional probability distributions, and are able to estimate the likelihood of, represent, and generate data. The goal of this project is to investigate hierarchical compression, a technique that involves compressing data from high to low levels, also known as coarse-to-fine compression.
The motivation behind hierarchical compression is to discard information that is conceptually redundant and not essential for maintaining the perceptual quality of the image, while still retaining its most important features. This research area is relatively unexplored, and it is rare to find DGMs that function with a variable number of hidden components, also known as latent representations.
The project will address this gap in knowledge by studying how DGMs can represent data to compress it hierarchically. This involves understanding how the model can learn to extract and represent the most salient features of an image at multiple levels of abstraction. The results of this research could have significant implications for a variety of fields in artificial intelligence and data analysis.
In addition, the project will explore the interrelated topics of linear factor models and representation learning. Linear factor models, such as principal component analysis (PCA), have been used to efficiently represent data by exploiting correlations between different dimensions. This allows for high-dimensional data to be represented by low-dimensional data and is more effective the more correlated the dimensions are. For example, PCA has been used to represent 3D human body shapes by exploiting correlations between different features such as fat/thin and tall/short.
Representation learning involves learning representations of the data that are useful for subsequent processing, such as classification or clustering. DGMs have been shown to be effective at learning representations that capture the underlying structure of the data. By studying how DGMs can be used for hierarchical compression, the project aims to contribute to the field of representation learning and develop techniques that can be applied to a wide range of problems in data analysis and artificial intelligence.
This project uses the Variational Autoencoder (VAE) as the deep generative model. The VAE is a type of deep learning model that assumes independently and identically distributed data (i.i.d.)
VAEs offer an efficient solution to approximate three important quantities:
- The maximum likelihood estimate of the parameters
$\theta$ , allowing artificial data to be generated by sampling$\mathbf{x} \sim p_\theta(\mathbf{x} \mid \mathbf{z})$ . - The posterior inference of the latent variable
$\mathbf{z}$ given data$\mathbf{x}$ ,$p_\theta(\mathbf{z} \mid \mathbf{x})$ , which is useful for knowledge representation. - The marginal probability of the variable
$\mathbf{x}$ ,$p_\theta(\mathbf{x})$ , which can be used to determine the likelihood of the data.
The VAE jointly learns a deep latent variable model (DLVM)
In practice, a VAE consists of an encoder that transforms an input
In summary, VAEs provide a powerful tool for modeling complex high-dimensional data by learning a lower-dimensional representation that captures the salient features of the data.
Towards Conceptual Compression introduces Convolutional DRAW, an RNN VAE that generates images over a flexible number of time steps. However, by using an RNN architecture, the latent representations lose meaning as the generation is dependent on the latent representations and previous hidden states. Moreover, generating images has a complexity of O(t). On the other hand, Principal Component Analysis Autoencoder introduces PCA-AE, an autoencoder method inspired by PCA. PCA-AE has meaningful latent representations, but requires multiple encoders to do so. Furthermore, PCA-AE does not support a flexible number of time steps. At generation, the decoder must receive the number of latent variables it was trained with.
This project's aim is to get the best of both worlds, under the assumption that the following features are desirable in a model:
- High and low-level, or coarse-to-fine features of the data are naturally separated.
- The latent representations are meaningful.
- Generation is possible with an arbitrary number of latent representations (although not an arbitrary order).
- Only a single encoder and decoder are necessary.
- Generation takes O(1) time.
Following from this, there are two main contributions of this project. Firstly, a novel Modular VAE architecture has been developed that allows for 2, 3, 4, and 5. Secondly, multiple methods have been developed that achieve 1 and 2, utilizing the Modular VAE architecture. A summary table of Convolutional DRAW, PCA-AE, and the methods developed in this thesis can be found below:
Convolutional DRAW | PCA-AE | This work | |
---|---|---|---|
High and low-level | ✔️ | ✔️ | ✔️ |
Meaningful latent representations | ❌ | ✔️ | ✔️ |
Flexible number of latents | ✔️ | ❌ | ✔️ |
Single encoder and decoder | ✔️ | ❌ | ✔️ |
O(1) generation | ❌ | ✔️ | ✔️ |
The methods investigated in this project can be explained at a high-level in terms of a diagram, pictured below.
Here,
The high-level coarse-to-fine framework this work considers. Image taken from https://www.robots.ox.ac.uk/~nsid/notes/c2f-vae.html. Note: "l" and "L" in the diagram are equivalent to "t" and "T".
The equations for the Standard VAE architecture can be found below:
In order to have the desired qualities and circumvent the issues of the VAE, a modular VAE architecture was developed for the project.
Whereas a standard VAE has two components, an encoder and a decoder, the modular VAE has four components: an encoder, a component called
A diagram of the Modular VAE architecture.
A desireable feature of a coarse-to-fine perceptual decomposition model is that it has the ability predict the number of steps
Then, training the component can be achieved by adding a new term to the loss:
Where the parameters subscripted by
In practice, PyTorch's implementation is used for the CDF to allow for gradient decent, and the expectation term is detached from the graph.