Introduction

Topological data analysis (TDA) describes the shape of noisy and potentially incomplete data in a robust way, so that such data can be better understood and utilised. While there have been several recent breakthroughs, the sequential nature of most existing TDA methods limits their effectiveness. This project aims to tackle this by making the TDA technique ‘persistent homology’ work across parallel processors, in order to produce scalable software.

Explaining the science

Unsupervised machine learning involves training algorithms that are designed to identify groupings of ‘unlabelled’ data without knowing in advance what the groups will be. ‘Persistent homology’, a widely used TDA technique, utilises unsupervised learning to compute topological summaries of data over a range of spatial resolutions.

Features of the shape of the data – such as connected components, ‘tunnels’, and ‘cavities’ – which persist at several resolutions are deemed to represent the true ‘coarse’ geometry of the underlying data, while short-lived features are likely to be artefacts of sampling, noise, or a particular choice of parameters. With this knowledge of the genuine geometric features of a dataset, it becomes possible to draw conclusions about the data, even if it’s incomplete.

Accurately computing persistent homology requires working with matrices whose size might be exponential in term of number of input data points. Therefore for TDA to become applicable to gigantic modern datasets, it is therefore essential to improve the practical performance of existing matrix algorithms.

Project aims

This project is dedicated to developing software which makes the persistent homology pipeline much more parallel in structure, making TDA of truly massive datasets possible.

The starting point is the well-documented GUDHI software package maintained by researchers at INRIA in France. The intention is to augment the GUDHI algorithms with novel ‘extremely parallelisable reduction’ algorithms developed at the Turing by Rodrigo Mendoza-Smith under the supervision of Jared Tanner. These algorithms allow the matrix column operations which form the backbone of persistent homology computation to be safely distributed across multiple processors.

The Turing’s unique ability to contribute professional software engineers alongside domain experts is invaluable. The development of this software will enable researchers at the Turing and elsewhere to apply topological data analysis at a previously unachievable scale. It will also lead to the development of new techniques that make topological data analysis more robust to the presence of egregious outliers in large datasets.

Applications

Over the last two decades, TDA has been successfully applied across physical, biological and economic sciences. Perhaps the best-known example of TDA’s success is the discovery of a new type of breast cancer from an old dataset which had been analysed via other techniques for decades. Therefore the potential improvements this project aims to make to existing TDA algorithms could directly benefit a broad range of research groups working in disparate application domains.

Organisers

Researchers and collaborators

Contact info

[email protected]

Research Engineering

View the Research Engineering page

Members of the Research Engineering Group at the Turing are contributing their expertise to this project.

Two Research Software Engineers were involved in developing a package for computing persistent homology of truly massive data sets. A recent algorithmic breakthrough by Turing researchers Rodrigo Mendoza Smith and Jared Tanner exploits compressed sensing techniques and the special structure of TDA boundary matrices, to massively parallelize the reduction process. The engineers rewrote the original Matlab prototype in C++ and enhanced it with custom data structures, memory allocation methods as well as the Intel Threading Building Blocks library for parallelisation.

In order make this algorithm more accessible and to benefit the wider research community, they have developed an optimized version of the algorithm as an extension to a software package called Gudhi.

A full set of unit and regression tests were developed for each part of the implementation, and the software package was containerized alongside associated library dependencies, to permit reproducible comparisons.