Scalable regression: Tools and techniques

Developing techniques for regression models to scale at large data volumes, incorporating regression model power into big data analytics engines

Project status



The problem being addressed is three-pronged: Firstly, the development of regression models (used to estimate the relationships among variables) that can scale with large volumes, without sacrificing accuracy. Secondly, developing models that can learn which models are best to use during analytical-query processing for the task at hand. Thirdly, to incorporate these powerful models within analytical-query processing engines, deriving a new paradigm for approximate analytical-query answers based on these models.

The expected output are prototype software artefacts that produce these models and a prototype approximate query processing engine that utilises the models for efficient, accurate, and scalable analytics processing.

Explaining the science

Regression is a principal means for predictive analytics. As such, it plays a central role in many communities within data science, ranging from statistical machine learning, to data management, online analytics and data warehouses. Despite this fact, scaling regression models (RMs) and algorithms to big-data volumes remains an elusive goal with performance and/or accuracy being sacrificed.

Additionally, coupling regression models and tools with big data analytics stacks (BDASs), like Spark and Hadoop, is cumbersome, discouraging many data scientists without core expertise in BDASs. As a result, the high potential benefits of RMs over big data collections have not yet materialised. This project aims to fill this gap. Particular emphasis will be placed on employing RMs to analyse big data collections when analysis proceeds in a piecemeal fashion, interrogating each ad-hoc defined data subspaces at a time. This is particularly challenging as RMs are trained over the whole data space, failing typically to capture local data characteristics.

Project aims

The project is studying whether it's possible to learn which regression models to employ for big data collections, and under which circumstances. Specifically, the project will investigate and provide answers to the following questions:

  • "Given an analytical query, can we select the best model to process it and predict its result?"
  • "Does it matter which model we use?"
  • "Finally, can we develop a model that learns which model to use for different analytical queries?"

Given the above, the next aim of this project is to showcase the power of the above knowledge/models by studying a new approximate analytical query processing engine paradigm, based on regression models to predict accurate query answers instead of accessing the underlying data deluge.

Another key aim is for the project to bring together relevant expertise in programming and data systems, serving as a springboard for future R&D efforts from Turing researchers.

The project will deliver papers and reports for the above central issues as well as prototype software and experiments with real-world and synthetic data and analytical queries.


Given the central role of predictive analytics, the above results could benefit all areas of data analytics. Key examples, include urban analytics (predicting atmospheric particulate concentrations, pollution levels, traffic jams etc); crime and justice (e.g. analysing crime rates within parts of cities and at certain times and help policy makers and police patrol deployments); and any IoT-based applications, including for health monitoring, analysing readings of the metric of interest as it evolves in time/space.


Researchers and collaborators

Contact info

[email protected]