After careful consideration we have decided to postpone the Data Study Group scheduled to take place at the Institute from April 20 – April 24. Given the current concerns about coronavirus we want to minimise any potential risks to those involved in the event.
All stakeholders, including challenge owners, PIs and participants, have been notified of the postponement.
It is our intention to reschedule the challenges when the public health situation stabilises. We will be in touch with challenge owners, PIs and participants with further updates in due course. 

About the event

What are Data Study Groups?

  • Intensive five day 'collaborative hackathons' hosted at the Turing, which bring together organisations from industry, government, and the third sector, with talented multi-disciplinary researchers from academia.
  • Organisations act as Data Study Group 'Challenge Owners', providing real-world problems and datasets to be tackled by small groups of highly talented, carefully selected researchers.
  • Researchers brainstorm and engineer data science solutions, presenting their work at the end of the week.

Why apply?

The Turing Data Study Groups are popular and productive collaborative events and a fantastic opportunity to rapidly develop and test your data science skills with real-world data. The event also offers participants the chance to forge new networks for future research projects, and build links with The Alan Turing Institute – the UK’s national institute for data science and artificial intelligence.

It’s hard work, a crucible for innovation and a space to develop new ways of thinking.

Reports from a previous Data Study Group are available here.


Read our FAQs for Data Study Group applicants.


Our challenges and datasets are provided by partner organisations for researchers to work on over the week.

The organisations and challenges leading the Data Study Group this April are (see the challenge descriptions below):

  • CRUK Cambridge Institute - Modelling interactions driving breast cancer development
  • Greenvest SolutionsForecasting solar panels' energy supply with satellite data
  • Humanising Autonomy - Identifying safety-critical situations on the road
  • Shield Digital - Detecting illegal online pharmacies and vendors of counterfeit prescription medication 
  • UCLIdentifying hospitals which are underperforming when participating in clinical research 

The skills that we think are particularly relevant to the challenges for this Data Study Group are listed under each challenge description below.  Please note, the lists are not exhaustive and we are open to creative interpretation of the challenges listed. Diversity of disciplines is encouraged, and we warmly invite applications from a range of academic backgrounds and specialisms.

Challenge descriptions

CRUK Cambridge Institute - Modelling interactions driving breast cancer development

The Cancer Research UK (CRUK) Cambridge Institute is making available a comprehensive dataset of gene expression in 400 Estrogen Receptor (ER) positive breast cancer cell line samples, this includes control experiments and perturbations in the form of gene knockdowns. In the cell lines considered ER is the main driver of breast cancer. Participants are invited to explore the role of perturbation targets –using AI and machine learning tools– in the development of breast cancer, this work will allow devising of new interventions to halt this process. No experience working with biological data is required.

Useful skills: Primarily novel network methods to analyse the data. No-one is expected (or likely) to have experience of all of the following, but skills may include: Network inference, neural networks, Bayesian networks, graph-based models, causal models, correlation/regression-based networks, Boolean Implication Networks, Nested Effect Models, Linear Effect Models, techniques to validate model robustness – e.g. boot-strapping, data visualisation.

This work is supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the “Data Science for Science” theme within that grant and The Alan Turing Institute.


Greenvest Solutions - Forecasting solar panels' energy supply with satellite data 

Greenvest is on a mission to accelerate renewable energy adoption worldwide. The start-up provides strategic technology to plan, monitor, and assess clean energy projects globally. Our proprietary solution combines machine learning (ML) and big data analytics on multisensory satellite data to model and predict renewable energy plants’ performances. In the year of its foundation back in 2018, Greenvest won the prestigious "ActInSpace" hackathon, a competition organised by the European Space Agency and Centre National d'Études Spatiales, where more than 4,000 participants from over 35 countries were involved. Recently, Greenvest was also selected as one of the Top 500 Deeptech Startups Worldwide by “Hello Tomorrow”, reaching the finals of the Launchpad Program by London Business School, and has joined the Energy Access Africa consortium to perform the analytical and quantitative assessments of social impact on rural electrification projects.

Aims of the challenge

When the usage of energy reservoirs (like batteries) is deprecated due to external circumstances, electricity supply and demand in power grids must constantly match. As a result, this often implies embedding in electric networks backup generators, which turn out to be expensive and polluting appliances.  

A promising solution consists in relying on Photovoltaic (PV) panels, i.e. devices capable of converting solar into electric energy. However, unlike traditional electricity generation, solar energy sources cannot be easily switched on and off on demand. In the scientific community, interesting results have been obtained in now-casting the weather using ML and similar results could be obtained for solar radiation now-casting.

Thus, the aim of this challenge is to forecast solar irradiation on the ground for a specific location. The key difficulty is to understand the losses due to clouds, further the correlation between atmospheric parameters and cloud data, and provide fast computation (optimally 30 minutes into the future using present data).

This could be achieved by combining ML, big data, and geospatial expertise to predict the amount of solar output. To reach this goal, we will attempt to leverage time-series of solar radiation data, atmospheric and cloud parameters as well as multi-spectral images of clouds, to make an overall prediction.

The benefits of an accurate system could include better use of residential resources, opening new opportunities of aggregators to capacity and flexibility markets, and the reduction of backup appliances. Overall, this could result in significant potential savings of £1-10 million per year and about 100,000 tonnes of CO2 per year.

Useful skills: Time series, spatio-temporal pattern prediction, functional data analysis, satellite images, machine learning. 


Humanising Autonomy -  Identifying safety-critical situations on the road

Across 84,968 incidents reported in 2018, over 67% consider driver error as a primary contributing factor. To tackle this for future autonomous and semi-autonomous systems, Humanising Autonomy aspires to be the global standard for the interaction between autonomous vehicles (AVs) and vulnerable road users (VRUs), making sure that the systems of the future take human behaviour into consideration. 

We have taken a human-first approach in designing AI systems allowing us to model the behaviour of VRUs and better inform AVs in operation. With this, we believe it is possible to improve the safety and efficiency of AV systems in urban environments.

Considering the latest research in Behaviour Science and employing a novel, modular machine learning approach, we are creating models that can seamlessly integrate into existing AV stacks and improve the explainability of decision-making systems.

Aims of the Challenge

At Humanising Autonomy, our ultimate aim is to improve the ability of AVs to respond to VRU behaviour whilst on the road, by incorporating the necessary context. In the short-term, we are looking to better inform Advanced Driver Assist Systems (ADAS) to also take the VRU behaviour into consideration. To do this we are gathering data from multiple different sources to identify edge cases and improve our future systems. 

One of the greatest challenges in the adoption of autonomous vehicles in the market is that incidents occur in edge cases when something unpredictable happens. Our dataset is uniquely equipped to identify these edge cases, as it has labelled cases of accidents, near-misses, and normal driving. Our partners also share with us large amounts of unlabelled data captured directly from vehicles. Within this data, there is a huge variation in both quality and relevance. Cases with single VRUs or varying crowd sizes are very different and need to be dealt with in different ways. 

With this in mind, the aim of the challenge is to estimate the criticality or classify the footage in regards to what is happening between the VRU and the operating vehicle. We want to first classify collisions or near-misses within the data, with a secondary aim of predicting what is the likelihood of an incident occurring. 

The data provided will be pre-processed by our in-house machine learning models to extract information from commercial and personal vehicle video footage, e.g. dashboard-cameras. The data will include the location and motion of objects in relation to the vehicle, the type of objects detected around the vehicle and the perceived motion of the vehicle. The videos themselves will also be categorised between near-misses, collisions, crossing events, and control scenes where VRUs are not crossing the path of the vehicle.

Useful Skills: Computer Vision, programming skills, Classification/Regression algorithms, semi-supervised learning, application of data science to spatio-temporal data, software engineering and neural network experience extremely useful. 


Shield Digital - Detecting illegal online pharmacies and vendors of counterfeit prescription medication

Every year, $200 billion in counterfeit medicines is traded across the globe, including in the UK, where law-enforcement seized $2 million worth of counterfeit medications (Oct 2018). The items discovered were intended for sale online, via networks of illegal online pharmacies. Shield Digital aims to assist governments, law enforcement, and pharmaceutical companies in tackling this problem. To this end, we have built a database of illegal online pharmacies (both active and archived), containing observations for different covariates, including images, textual descriptions, templates, medicine names, IPs, Google Analytics IDs, technologies and more. 

It is notable that once an illegal website is shut down, the site operators quickly create new ones, which makes the task of shutting them down manually onerous and mostly ineffectual. Fortunately, many of these websites share the same covariates, such as template, analytics accounts used, or other attributes in their CSS HTML source code. As such, it is feasible to put them into different clusters based on these similarities. With this in mind, the goal is to develop a scalable algorithm to detect illegal websites, which can take hundreds or thousands of web addresses as input and estimate the cluster they fall into (legal or illegal).

Useful skills: Web scraping, data science, machine learning, natural language processing, neural networks, text and network embeddings, representation learning (on networks and texts), graph algorithms. 


UCL - Identifying hospitals which are underperforming when participating in clinical research 

Clinical trials are currently the gold standard for testing the safety and effectiveness of treatments (drugs or surgical procedures) in clinical care. There is a legal mandate to ensure the safety of participants and the quality of the data generated. A Clinical Trials Units (CTU) oversees and directs the research, with multiple trial sites (usually hospitals) taking part. While each site is responsible for recruiting the patients and collecting data, the CTU must ensure, through trial monitoring, that all applicable ethical and regulatory requirements are adhered to.

Trial monitoring often involves (1) visiting each site to perform 100% source data verification (SDV), which is very time-consuming and expensive or (2) Risk Based Monitoring (RBM), which uses performance indicators for each site to determine the extent, timing, and frequency of monitoring visits. However, choosing performance indicators, and validating the models is challenging.

The Question: Can we use AI/ML approaches to identify/predict which sites within an ongoing clinical trial are underperforming or at risk of non-compliance, using centrally held patient-reported data, and previous longitudinal site monitoring data? These predictions will assist in the prioritisation and planning of monitoring actions (e.g. site visits).

Useful skills: Supervised and semi-supervised learning (for classification), predictive models for time series/dynamic prediction, natural language processing (NLP), data fusion/multimodal data/record linkage.

This work is supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the “Health” theme within that grant and The Alan Turing Institute.


The Alan Turing Institute will cover travel costs in alignment with our expenses policy. We will also provide accommodation for researchers not normally London-based. Expenses for international applicants is capped at £200, which includes any costs of visa. Lunch and dinner is provided for participants during the week. 

The Data Study Group involves a commitment of up to 50 hours across five days and participants are expected to attend the full duration of the event where possible. (While DSG does involve long hours, The Alan Turing Institute is committed to supporting individual circumstances, please email [email protected] to discuss any adjustments you may require).

The Alan Turing Institute is committed to increasing the representation of female, black and minority ethnic, LGBTQ+, disabled and neurodiverse researchers in data science. We believe the best solutions to challenges result when a diverse team work together to share and benefit from the different facets of their experience. You can review our equality, diversity and inclusion (EDI) statement online


Find out more

Learn more about being a DSG participant including FAQs

How to write a great Data Study Group application

Queries can be directed to the Data Study Group Team