The Alan Turing Institute’s Research Engineering Group (informally known as ‘Hut23’) are experienced data science researchers who are committed to professional delivery of impactful research, rather than to personal research interests.
We use the challenges that arise in practical projects to inspire new research initiatives, while bringing novel techniques and methods to customers and collaborators that go beyond current practice in data science.
Working with the Turing research community, and with research professionals in partner institutions, the group provides a comprehensive source of in-house research skills for the Institute’s activities.
Visit the Group’s lab pages which include information on their regular lunchtime tech talks and projects.
Research data scientists
Our Research Data Scientists have expertise in computational statistics, inference, and machine learning, as well as mathematical and computational modelling of complex systems, knowledge representation, and operations research.
They apply their skills to clean, wrangle and analyse data, and to deploy analyses developed by Turing researchers on our high-performance computing platforms.
Research software engineers
Our Research Software Engineers collaborate with our researchers to build and maintain software that implements and supports the research activities.
Research Software Engineers work with researchers to create software requirements, develop code, document and explain the software, and support the release and maintenance of the software through open-source channels and publication in research journals.
Artificial intelligence for data analytics (AIDA)
In this project, researchers at the Institute are drawing on new advances in artificial intelligence and machine learning to address data wrangling issues; they aim to develop systems that help to automate each stage of the data analytics process.
Artificial intelligence for data analytics (AIDA) - Datadiff
Datadiff is an AIDA sub-project which aims to automate the process of reconciling inconsistencies between pairs of tabular datasets whose information content is supposed to be similar.
When a dataset is batched into multiple tables, for instance due to periodic data collection, it is not uncommon to find discrepancies in format or structure between the different batches, such as the renaming and/or reordering of columns, changes in units of measurement or encodings, introduction of new columns, etc. Such differences impose an overhead on any consumer of the data wishing to join the separate pieces into a consistent whole.
Typically this process involves human intervention: people are good at resolving issues of this kind by spotting patterns and making educated guesses. Datadiff is an attempt to solve the problem algorithmically.
This is a joint venture project with University College London, Imperial College and industry partners. Code Blue is a science gateway which scales to enable users to run fluid dynamic simulations in the cloud or their university computing infrastructure.
Working with the children’s charity Coram, this project is to explore how data collected on children in care can be modelled and visualised to help inform the decisions of local authorities.
Evaluating homomorphic encryption (SHEEP)
SHEEP is a homomorphic encryption evaluation platform; homomorphic encryption allows mathematical operations to be performed on encrypted data, such that the result when decrypted is what would have been obtained from addition or multiplication of the plaintexts. The goal of the project is to provide an accessible platform for testing the features and performance of the available homomorphic encryption libraries.
This is a short project led by Institute researcher Tomas Petricek. The Gamma project empowers anyone to examine data, learn to question the legitimacy of its data sources, and appreciate the context in which numbers are presented. The code behind the project is open source.
This project with the National Cyber Security Centre will attempt to identify websites that are provided “unofficially” by government and are not under a well-known top-level domain.
Proof-driven querying (PDQ)
PDQ is a Java library developed by researchers at the University of Oxford for generating database query plans over semantically-interconnected data sources with diverse access interfaces.
A key target application is to the problem of making NHS data accessible to data scientists while respecting constraints imposed by privacy, integrity and efficiency. Our project aims to bring this goal closer by refining and extending the library’s query execution functionality.
Safety of offshore floating facilities
This project is a collaboration with the Australian Research Council Industrial Transformation Research Hub for Offshore Floating Facilities. Solitons are large, non-linear waves that can be generated offshore by certain conditions involving tidal forces and the shape of a continental shelf. They can have significant impact on offshore facilities for the oil and gas industries. The goal of this project is to provide a probabilistic model for soliton formation, producing a distribution of predicted soliton amplitudes, in order to inform decision making.
A robust and well-tested set of software packages will enable these output distributions to be calculated in a timely manner, allow new measurements to be incorporated into the modelling and visualise the modelling results via an intuitive dashboard.
Scalable topological data analysis
The goal of this project is to rewrite existing topological data analysis code created by a Turing PhD student to firstly run massively parallel on an effectively unlimited number of CPUs with near linear scaling, and secondly to integrate into at least one public software package.
Security in the cloud
This project is a collaboration with Imperial College and aims to establish a cloud computing model for sensitive data, where the cloud provider itself can remain untrusted by the owner of the data. It does this by leveraging Intel's Secure Guard Extensions (SGX) which are available in recent generations of Intel CPU, and allow certain computations to be carried out in a trusted "enclave" of an otherwise untrustworthy system.
An outcome of the project will be an extension to the Apache Spark big-data platform to allow this technology to be used for distributed computation, as well as a significant amount of the supporting operating-system-level infrastructure.
Working with us
Turing Fellows and students looking for help from the Group should email the relevant Research Engineering Challenge Lead.
REG Challenge Leads
Work is usually recharged to various Turing research budgets, or to other research grants, and we will work with you to find an appropriate funding source.
The team is also keen to get involved in research grants being developed by the Institute and collaborating universities – we can be included to provide any amount of software engineering or data science effort for your grant.
While the work is usually costed, the Group is occasionally able to support unfunded projects for free where there is a strong strategic reason – do get in touch.
Soon we hope to be able to share our expertise with external partners. In the interim, please contact [email protected] to register your interest in working with us.