## Introduction

Most computational problems in data science involve trying to find either an average of some kind or some optimal configuration, but this is typically difficult to do using mathematics alone. However, sampling from a probability distribution can help with this i.e. doing something random in such a way that if you do it often enough the properties of that sample tell you something about what you want to know. This project aims to develop methods that involve sampling from ‘ensembles’ (probability distributions of the state of a system) of large numbers of interacting points or ‘particles’, with a focus on data-centric engineering problems where there are huge numbers of unknowns.

## Explaining the science

The 'sequential Monte Carlo methods' being developed in this project are something like trying to do statistics backwards. In a normal statistical problem you have some sample that you think can tell you about a certain population if you can do the right mathematics. In ‘Monte Carlo’ type methods you do things the other way round: you create a ‘synthetic’ sample and do the same sort of things you’d do in statistics in order to find something out about a quantity that you can’t evaluate analytically. The particular methods developed in this project aim to work in settings in which its hard even to produce this synthetic sample.

A good example is the Mars rover which had to work out where it was and what the environment looked like given what it had seen so far, without frequent input from Earth. This problem can be approached as a probabilistic updating problem. Rather than trying to keep track of multiple probability distributions and doing the related mathematics, you deal with a collection of points which between them give some sort of approximations of the Martian environment. By updating the weightings you attach to these points you can maintain a summary of your certainty, or uncertainty, about where the rover might be at any given time given what it’s seen so far.

One useful approach is to view these points as ‘particles’ that interact with each other. If you look at what happens when you have an enormous number of particles you can often say something about what the limiting value of your estimate will be, and the rate at which it’s approached, and what the fluctuations look like. You can say something about how quickly you approach the right answer as you apply more computing power. This gives you a pretty good idea of how things are going to work when you have a finite computational budget.

Many properties of such systems are now reasonably well characterised, but high dimensions, distributed architectures and enormous datasets bring their own challenges. This project seeks to extend our knowledge of the behaviour of this type of system.

## Project aims

The primary aim of this project is to extend the range of problems which can be effectively addressed by sampling from ‘ensembles’, probability distributions of the state of a system, of large numbers of interacting points or ‘particles’.

These 'Sequential Monte Carlo methods' (see 'Explaining the science') have proved extremely successful in many settings. However, there are some difficulties with the scaling of these methods to problems where there are very large datasets with a large number of unknowns, which often arise in data science in general and data-centric engineering in particular.

An example of this is in medical imaging - a PET image of single subject in a single scan might consist of 250k time series, so analysing each of those separately, a currently common approach, means that 250k problems must be solved for every image.

The project has three main aims:

- To improve understanding of ensemble-based methods in the context of data-centric engineering and related disciplines via theoretical analysis.
- To leverage that understanding to develop methods which scale better to big data problems, distributed, high-performance computing architectures, and large data sets.
- To use the resulting methods to answer question arising in data-centric engineering.

## Applications

Existing application areas for the types of methods being investigated are diverse including, for example, neuroimaging, ray-tracing and autonomous navigation as well as the numerical solution of (and uncertainty quantification for) differential equation systems. Improvements to methodology, and the understanding of that methodology, can be expected to have consequences throughout these and many other areas.

The researchers are working with population biologists to use the techniques they’ve developed in their field to answer questions which arise in the contexts that this project is pursuing.

It is anticipated that real problems arising in a variety of urban process contexts will be addressed using the methods developed and understood.