Design, analysis, and applications of efficient algorithms for graph based modelling

Project goal

Developing fast and efficient numerical methods for optimisation problems on graphs, making use of continuum (large data) limits in order to develop multi-scale methods, with real-world applications in medical imaging and time series data.

People

Interns

Oliver Crook and Timothy Hurst

Supervisors

Matthew Thorpe, University of Cambridge
Kostas Zygalakis, Turing Fellow, University of Edinburgh
Carola-Bibiane Schönlieb, Turing Fellow, University of Cambridge
Elizabeth Soilleux, University of Cambridge
Mihai Cucuringu, Turing Research Fellow, University of Oxford

Project detail

Many machine learning methods use a graphical representation of data in order to capture the geometry of it, in the absence of a physical model. If we consider the problem of classifying a large data set, say 107 data points, then one common approach is spectral clustering. The idea behind spectral clustering is to project the data onto a small number of discriminating directions where the data should naturally separate into classes. In practice one uses the eigenvectors of the graph Laplacian as directions and then uses off-the-shelf methods such as k-means for the clustering. More importantly, this methodology easily extends to the semi-supervised learning context.

A bottleneck in the above approach is in the computation of eigenvectors of the graph Laplacian. The dimension of the graph Laplacian is equal to the number of data points and therefore becomes infeasible for large data sets. Our approach is to use continuum (large data) limits of the graph Laplacian to approximate the discrete problem with a continuum PDE problem. We can then use standard methods to discretise the continuum PDE problem on a potentially much coarser scale compared to the original discrete problem.

In particular, instead of computing eigenvectors of the graph Laplacian, one would compute eigenfunctions of the continuum limit of the graph Laplacian and use these instead. This should remove the bottleneck in spectral clustering methods for large data. The approach is amenable to multi-scale methods, in particular by computing coarse approximations and iteratively refining using known scaling results.

The project will start by implementing modifications of existing algorithms, in particular we will replace bottlenecks such as computing eigenvalues with an approximation based on continuum limits. Once we have a working algorithm we aim to take the project further by developing classification algorithms for diagnosing coeliac disease from medical images. In particular, using our algorithms, we aim to improve on the current state of the art methods of diagnosing coeliac disease (microscopic examination of biopsies), which is inaccurate with around 20% misclassification.