Interpreting large datasets using simulation

Developing a sampling method that simulates representative samples given a large dataset

Introduction

It is usually hard to understand and interpret the underlying probabilistic models of a large dataset, particularly when the dataset is high dimensional (such as images). This project investigates the possibilities of using a recently developed method called Generative Adversarial Network (GAN) to simulate samples which help to understand a large dataset and its underlying models. The expected output is a generic sampling scheme that generates specific types of samples that paint a comprehensive picture of a dataset. 

Explaining the science

Simulating samples from distribution is a common statistical task which is widely used to draw data points from a known probabilistic model. 

However, there are cases where the model is not yet known to us but only the dataset itself is available. Generative Adversarial Network is a recently developed method that can simulate samples by using a dataset without an explicit probabilistic model. It works by continually adjusting the simulated data until the difference between the simulated data and the real dataset cannot be measured anymore. This approach brings a new school of thought into the recent development of sampling methods. 

In this project, the implicitness of GAN will be utilised, i.e. sampling without a known probabilistic model, to simulate samples from an underlying distribution that may not have a closed form or is too complicated to learn. 

Project aims

Probabilistic models behind high dimensional datasets are not usually easy to understand by humans, particularly when they are represented using deep neural networks. In contrast, many high dimensional samples themselves are very interpretable, such as images. Therefore, the question addressed in this project is, is it possible to use samples themselves to describe a high dimensional dataset? 

The project's researchers expect to build a generic sampler which would produce samples that are representative of the underlying probabilistic model. Certainly a few samples cannot be a good representation of the entire model or the dataset. Samples are only simulated that are representative in a certain sense in the context of an application. For example, if a sample of a handwritten '7' is requested, the algorithm would simulate a sample '7' which is in close proximity of the vast majority of handwritten '7's in its dataset. The researcher's will first build such an algorithm on hand-writing datasets, such as MNIST and benchmark its performance. 

There has been an increasing demand for transparent machine leaning in recent years. However, the development of deep neural networks denies meaningful interpretations of its models, which is a major concern in many applications. The proposed project, if successful, can be used to extract meaningful information from trained neural network models in the form of simulated samples without re-training the model. 

Applications

As machine learning techniques become widely used in many areas of society, the public demands better understanding of large datasets and probabilistic models trained on them. However, many machine learning models, especially deep neural networks, do not admit an interpretable form. This project, if successful, can simulate samples which can be regarded as summaries of a dataset and a model. 

The resulting algorithm can be used in scientific applications such as physics where the generated critical samples may provide valuable information on an existing dataset and its underlying mechanism. The algorithm may also play a role in educational applications where one may simulate samples as representations of abstract concepts. For example, showing students samples of handwritten '7's is more intuitive than just listing the textual description of a written '7'. 

Contact info

[email protected]