Fundamentals of statistical machine learning

Developing statistical machine learning tools to keep up with the growing needs of the engineering sciences

Project status



The emerging field of data-centric engineering, which aims to apply state-of-the-art statistical, machine learning and AI technologies to enhance modern engineering practice, has created several new challenges for theory and methodology. This research group works in parallel with both practitioners and theoreticians to develop statistical machine learning tools to keep up with the growing needs of the engineering sciences.

Explaining the science

The fusion of the engineering sciences with data science has encouraged researchers to work with ever more complex models. The growing complexity of these models has several advantages: it allows engineers to represent more complex physical phenomena, to provide more accurate and robust predictions, and to calibrate these models with the wide range of datasets now available to them. The result has been significant advances in a range of fields such as large scale weather models or the detailed models of the heart's chemical signals. Another example is the recent advances in digital twins, which are used to reliably monitor large structures in aeronautics, the built environment or in the construction industry.

However, these new advances also create significant challenges for statisticians and machine learners. The main issue has been calibrating the models to the data available (called 'inference' or 'learning') so that they are good representations of reality and can be used for prediction. The difficulty in this task is due to the complexity of these models, which has meant that existing statistical methods are not applicable and new algorithms need to be developed.

A first significant challenge is that these new methods need to cope with the increasingly large associated computational costs which come with large datasets and complex models. On top of this, there is often no guarantee that the model developed by engineers will be a reasonable approximation of reality, and our algorithms need to be able to still return reasonable estimates in those cases. Finally, the uncertainty remaining in our predictions after having run our algorithms need to be properly understood since these models often affect safety-critical infrastructure.

Project aims

This group focuses on tackling the following three fundamental challenges for statistical machine learning:

1. Fast and robust inference for complex models

A major challenge is the complexity of modern engineering models, for which it is often not possible to posit a closed-form likelihood function and hence to calibrate the models with data. This requires the development of novel statistical and machine learning tools for inference in these settings.

This research group focuses specifically on developing and studying inference methods for generative models (for which new synthetic data can be simulated and compared to the true data) or other un-normalised likelihood models (for which the shape of the likelihood is know, but it cannot be evaluated in closed form). A significant emphasis of this work is on the robustness of these methods to model misspecification or corrupted data, which is essential to ensure safety in critical application areas.

This strand of research complements the Data-Centric Engineering programme project on 'Theoretical foundations of engineering digital twins' led by Dr Andrew Duncan.

2. Computation under computational constraints

The use of statistics and machine learning methodology usually requires approximating complex mathematical quantities using algorithms. Examples include intractable integrals, optimisation problems or differential equations. For large-scale models, a significant challenge is the computational cost of running existing algorithms on large computer clusters. This computational cost can prevent us from obtaining precise estimates of these mathematical quantities, and can hence lead to poor prediction capabilities. It is also strongly undesirable since the use of advance computational resources has a negative impact on climate change due to the large energy needs of computer clusters.

The second goal of this research group is hence to develop algorithms which can make use of the structure of complex engineering models to minimise the computational resources required to obtain precise approximations. A significant focus is put on methods for approximation and numerical integration in the context of large-scale models, as well as the approximation of complex probability distributions which come up in this context.

This strand of research complements the Data-Centric Engineering projects on 'Sequential sampling methods for difficult problems' and 'Machine learning with polynomials' led by Dr Adam Johansen and Dr Pranay Seshadri respectively.

3. Uncertainty Quantification

Due to the challenges associated with inference and computation detailed above, it is common to have a high level of uncertainty associated with any prediction of these complex models. Quantifying the uncertainty in these models and hence in the resulting engineering tools is of prime importance for safety-critical applications where data-centric engineering tools are being used.

The third goal of this research group is hence to advance our understanding of large-scale uncertainty quantification methods. A specific emphasis will be put on Bayesian methods, including Bayesian non-parametric methods (such as Gaussian processes), which are better suited for these problems. The research focuses on both theory for existing methods as well as new methodology to meet the needs raised by engineering applications.

This research complements the Data-Centric Engineering projects on 'Probabilistic numerics', 'Uncertainty quantification of multi-scale and multi-physics computer models' and 'Inverse problems' led by Professor Chris Oates and Professor Serge Guillas.


This research group focuses mainly on the theoretical and methodological challenges that engineering applications raise for statistical machine learning. Nonetheless, the group's research feeds off, and enhances research on some of the more applied data-centric engineering projects.