Data Study Group – November 2021

Bringing together top talent from data science, artificial intelligence, and wider fields, to analyse real-world challenges

Learn more Add to Calendar 11/29/2021 09:00 AM 12/17/2021 05:00 PM Europe/London Data Study Group – November 2021 Location of the event
Monday 29 Nov 2021 - Friday 17 Dec 2021
Time: 09:00 - 17:00

Event type

Data Study Groups

Event series

Data Study Groups


Please note that applications have now closed.

Remote format

Due to COVID-19 the Data Study Group will run remotely over three weeks and will be divided into two stages.

Stage 1: The precursor stage (part-time)

  • The precursor stage will last one week in the run up to the 'event stage' (29 Nov–3 Dec).
  • The maximum time commitment is 2.5 hours a day.
  • Online workshops / presentations / team-building in order to prepare for the ‘event stage'.

Stage 2: The event stage (full-time)

  • The 'event stage' will run over two weeks (6–17 Dec).
  • The core working hours will be 09:00–17:00 GMT every weekday, however flexibility will be demonstrated with regards to those participating in different time zones.
  • Group work begins and continues throughout.

Please apply through Flexi-Grant. Participants are expected to attend the full duration of the event. This is a full time engagement and engaging part time impacts other team members and your learning potential. The Alan Turing Institute is committed to supporting individual circumstances, please do not hesitate to email [email protected] to discuss any reasonable adjustments.


The challenges are:

  • Rapid identification of plankton using machine learning
  • Identifying poor performance at recruitment sites participating in clinical trial research  

Please see below for further details on each challenge.

The skills that we think are particularly relevant to the challenges for this Data Study Group are listed under each challenge description below. Please note, the lists are not exhaustive and we are open to creative interpretation of the challenges listed. Diversity of disciplines is encouraged, and we warmly invite applications from a range of academic backgrounds and specialisms.

Rapid identification of plankton using machine learning

Plankton plays an essential role in the global carbon cycle and carbon sequestration, regulating the exchange of carbon dioxide between the atmosphere, surface ocean and ultimately the seabed. In particular, the role of zooplankton in the food web is critical as they occupy a central position, often controlling the abundance of smaller organisms by grazing and providing food for many larval and adult fish and seabirds. Plankton is also used in global monitoring efforts providing reliable and sensitive indicators to climate change and ecosystem health. 

The RV CEFAS Endeavour, a multi-disciplinary research vessel, collects millions of plankton images during its surveys through the Plankton Imager system: a high-speed imaging instrument which continuously pumps water, takes images of the passing particles, and identifies the zooplankton organisms present. Due to the nature of the passing particles, images have varying shapes and sizes with a highly-skewed distribution towards smaller particles/images. Of these, over 80% can be classified as detritus (e.g., sand, seaweed fragments, microplastics) which are traditionally manually removed (by-eye) before any analysis, leaving the remaining plankton images.

The challenge here is to develop rapid on-the-fly machine learning methods for automatically classifying plankton species (using information on their shape, features and size) using a manually labelled dataset of 40,000 images.

Useful skills: Image analysis and computer vision, deep learning, imbalanced datasets, keen to expand or explore a new field, Ecology.

Identifying poor performance at recruitment sites participating in clinical trial research  

Clinical trials are currently the gold standard for testing the safety and effectiveness of treatments (drugs or surgical procedures) and diagnostics in clinical care. Ensuring the safety of participants taking part in clinical trials is of utmost importance, both from a legal and ethical standpoint. In addition, data generated as part of clinical trials must be collected in a scientifically robust manner, to ensure the outcome of the trial is reliable and meaningful. 

For large clinical trials, a Clinical Trials Unit (CTU) typically oversees and directs the research, with multiple trial sites (usually hospitals) taking part. Whilst each site is responsible for recruiting the patients and collecting data, the CTU oversees the trial, through central trial monitoring, to ensure that all applicable ethical and regulatory requirements are adhered to. Trial monitoring often involves: (1) Risk-Based Monitoring, which uses performance indicators for each site to determine the extent, timing, and frequency of monitoring visits and (2) visiting both routine and poorly performing sites to conduct a site audit, which includes source data verification. Performance indicators are typically based upon quantitative measures such as percentage of data returned to the CTU and numbers of adverse events reported. With monitoring visits being very time-consuming and expensive, choosing optimal performance indicators and thresholds is a key challenge to ensure the efficient running of a trial.

The question

Using centrally held CTU data, can we use AI and machine learning (ML) approaches to better identify and predict which sites within an ongoing clinical trial are performing poorly or at risk of non-compliance with the trial protocol? The key aim of this challenge is to improve on the current method of selection of which sites to monitor, with the optimal case being that only sites that require corrective action are visited. A proposed method of achieving this is to use ML to investigate the link between pre-visit performance indicator thresholds and monitoring visit outcomes. For example, when considering the performance measure regarding rates of data return, which threshold results in only the sites requiring the greatest corrective action being selected for monitoring visits? In addition, are there better performance indicators than those currently being used that can assist in the prioritisation and planning of monitoring visits?

The data

Time stamped data will consist of: (1) pre-visit quantitative performance indicators such as the number of patients consented to take part in the trial and progression of patients through the treatment protocol in addition to the raw datasets for the trial and (2) post-visit ground truth labels indicating how well a site is performing, along with textual data indicating which sites were visited, issues that were found during the visit, their assigned severity level and any corrective action required. 

What we are looking for:

  • Supervised and semi-supervised learning (for classification)
  • Time series / dynamic predictive models 
  • Natural language processing (NLP)

This work is supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/W006022/1, particularly the AI for Science (for CEFAS) and Health (for MRC CTU) theme within that grant & The Alan Turing Institute.  

About the event

What are Data Study Groups?

  • Intensive 'collaborative hackathons' hosted at the Turing, which bring together organisations from industry, government, and the third sector, with talented multi-disciplinary researchers from academia. (Please note this format is currently different due to COVID-19).
  • Organisations act as Data Study Group 'Challenge Owners', providing real-world problems and datasets to be tackled by small groups of highly talented, carefully selected researchers.
  • Researchers brainstorm and engineer data science solutions, presenting their work at the end of the week.


How to apply

Applications are being accepted through Flexi-Grant – apply now to avoid disappointment.

Apply now

Why apply?

The Turing's Data Study Groups are popular and productive collaborative events and a fantastic opportunity to rapidly develop and test your data science skills with real-world data. The event also offers participants the chance to forge new networks for future research projects, and build links with The Alan Turing Institute – the UK’s national institute for data science and artificial intelligence.

It’s hard work, a crucible for innovation and a space to develop new ways of thinking.

Read reports from previous Data Study Groups to see challenges and outcomes.


Read our FAQs for Data Study Group applicants.


Find out more

Learn more about being a DSG participant including FAQs

How to write a great Data Study Group application

Queries can be directed to the Data Study Group Team