Introduction
Topological data analysis (TDA) describes the shape of noisy and potentially incomplete data in a robust way, so that such data can be better understood and utilised. While there have been several recent breakthroughs, the sequential nature of most existing TDA methods limits their effectiveness. This project aims to tackle this by making the TDA technique ‘persistent homology’ work across parallel processors, in order to produce scalable software.
Explaining the science
Unsupervised machine learning involves training algorithms that are designed to identify groupings of ‘unlabelled’ data without knowing in advance what the groups will be. ‘Persistent homology’, a widely used TDA technique, utilises unsupervised learning to compute topological summaries of data over a range of spatial resolutions.
Features of the shape of the data – such as connected components, ‘tunnels’, and ‘cavities’ – which persist at several resolutions are deemed to represent the true ‘coarse’ geometry of the underlying data, while short-lived features are likely to be artefacts of sampling, noise, or a particular choice of parameters. With this knowledge of the genuine geometric features of a dataset, it becomes possible to draw conclusions about the data, even if it’s incomplete.
Accurately computing persistent homology requires working with matrices whose size might be exponential in term of number of input data points. Therefore for TDA to become applicable to gigantic modern datasets, it is therefore essential to improve the practical performance of existing matrix algorithms.
Project aims
This project is dedicated to developing software which makes the persistent homology pipeline much more parallel in structure, making TDA of truly massive datasets possible.
The starting point is the well-documented GUDHI software package maintained by researchers at INRIA in France. The intention is to augment the GUDHI algorithms with novel ‘extremely parallelisable reduction’ algorithms developed at the Turing by Rodrigo Mendoza-Smith under the supervision of Jared Tanner. These algorithms allow the matrix column operations which form the backbone of persistent homology computation to be safely distributed across multiple processors.
The Turing’s unique ability to contribute professional software engineers alongside domain experts is invaluable. The development of this software will enable researchers at the Turing and elsewhere to apply topological data analysis at a previously unachievable scale. It will also lead to the development of new techniques that make topological data analysis more robust to the presence of egregious outliers in large datasets.
Applications
Over the last two decades, TDA has been successfully applied across physical, biological and economic sciences. Perhaps the best-known example of TDA’s success is the discovery of a new type of breast cancer from an old dataset which had been analysed via other techniques for decades. Therefore the potential improvements this project aims to make to existing TDA algorithms could directly benefit a broad range of research groups working in disparate application domains.