High performance, large-scale regression

Project goal

Investigating distributed, scalable approaches to the standard statistical task of high-dimensional regression with very large amounts of data, with the ultimate goal of informing current best practice in terms of algorithms, architectures, and implementations.

People

Interns

Alessandra Cabassi and Junyang Wang

Supervisors

Anthony Lee, Programme Director, Data Science at Scale, The Alan Turing Institute, University of Bristol
Rajen Shah, Turing Fellow, University of Cambridge
Yi Yu, University of Bristol
Ioannis Kosmidis, University of Warwick

Project detail

The ultimate goal is to critically understand how different, readily available, large-scale regression algorithms/software and frameworks perform for distributed systems, and isolate both computational and statistical performance issues. A specific challenging dataset will also be included to add additional focus, and there is the opportunity to investigate more sophisticated, but less readily-available algorithms for comparison.

This project aligns to the Institute’s strategic priorities in establishing leadership and providing guidance for common data analysis tasks at scale. It can feed in to a larger data science at scale software programme around performance and usability, which it is hoped will be developed in 2018.

First phase: benchmark and profile available approaches on the Cray Urika-GX, and potentially other architectures, for a scalable example class of models with carefully chosen characteristics. Different regimes can be explored where there are substantial effects on performance.
Second phase: use the benchmarks and profiling information to identify which, if any, recently proposed approaches to large-scale regression may improve performance, with the advice of Yi Yu and Rajen Shah.

Second phase: use the benchmarks and profiling information to identify which, if any, recently proposed approaches to large-scale regression may improve performance, with the advice of Yi Yu and Rajen Shah.

Third phase: apply the skills and software developed to a large and challenging data set.

Throughout the project, documentation will be written to enable other data scientists to perform large scale regressions with greater ease, and understand the implications of using different architectures, frameworks, algorithms, and implementations.

This project is supported by Cray Computing.