Applications are open. Find out more about the challenges and how to apply below.
Due to COVID-19 the Data Study Group will run remotely over three weeks and will be divided into two stages.
Stage 1: The Precursor Stage (part-time)
- The precursor stage will last one week in the run up to the 'event stage' (6 - 10 September).
- The maximum time commitment is 2.5 hours a day.
- Online workshops/ presentations/ team building in order to prepare for the ‘event stage'.
Stage 2: The Event Stage (full-time)
- The 'event stage' will run over two weeks (13 - 24 September).
- The core working hours will be 9:00 - 17:00 GMT every weekday, however flexibility will be demonstrated with regards to those participating in different time zones.
- Group work begins and continues throughout.
Applicants should be able to commit to the duration of the event. The Alan Turing Institute is committed to supporting individual circumstances, please do not hesitate to email [email protected] to discuss any reasonable adjustments.
The challenges are:
- Modelling Amyloid Beta Plaque Formation in Alzheimer's Disease
- Predicting functional relationship between DNA sequence and epigenetic state: Can computational models pay attention to distant genomic variants which experiments show affect chromatin activity?
- Using machine learning to improve sleep habits in Dementia patients
- Automated assessment of vascular perfusion
- Improving the resolution of protein structure imaging
Please see below for further details on each challenge.
The skills that we think are particularly relevant to the challenges for this Data Study Group are listed under each challenge description below. Please note, the lists are not exhaustive and we are open to creative interpretation of the challenges listed. Diversity of disciplines is encouraged, and we warmly invite applications from a range of academic backgrounds and specialisms.
The following challenges and data sets are related to dementia research and have been provided by the UK Dementia Research Institute (UK DRI) and DEMON Network for researchers to work on during the event.
Modelling Amyloid Beta Plaque Formation in Alzheimer's Disease
Amyloid plaques are protein aggregates in the brain of patients with Alzheimer's Disease (AD) and a main hallmark of the disease. Their role in AD is however not yet fully understood. The Dementia Research Institute and Flemish Institute for Bioscience have used a novel omics technique to study the role of amyloid plaques: Spatial Transcriptomics (ST). This technique allows the measurement of all genes in a tissue, retaining spatial information with hundred-micron resolution and allowing in parallel the capture of histological information.
Current research recognizes a number of different plaque morphologies, but the following questions are yet to be answered: Do different plaque shapes affect the cellular and molecular reaction around these plaques? Do they have a different role in the pathology? Are there more shapes than the one we currently distinguish? A better understanding of the relationship between protein aggregate presence and the plaque morphology will hopefully help in better understanding the role of amyloid plaques, and may help in gaining a new perspective on understanding, preventing, and treating AD.
The goal of this project is to predict Amyloid plaque presence and morphology from the histological and gene expression landscape. The methodology generated by this project would also advance the interpretation of actively growing ST data that is of huge analytical demand. In particular, our aims are:
1. Determine the presence of amyloid plaques based on gene expression and plaque morphology;
2. Identify potential new sub-types of amyloid plaques using unsupervised clustering methodologies;
3. Characterise the ST landscape around the plaques for (biologically) relevant features extraction;
4. Generate machine learning models to learn the relationship between plaque morphology and molecular variation; and
5. Introduce explainable models to gain (biological) understanding of the model's predictions.
Useful skills: Primarily image analysis and machine learning methods.
Image analysis, classification and clustering, CNN/VAE, techniques to validate model robustness – e.g. boot-strapping, data visualisation. No experience in working with biological data is required.
Predicting functional relationship between DNA sequence and epigenetic state: Can computational models pay attention to distant genomic variants which experiments show affect chromatin activity?
Cellular development of healthy organisms, as well as their predisposition and response to disease are regulated by molecular programs which ultimately control gene expression. Existing deep learning models are able to predict where DNA is likely to bind regulatory proteins, and what is the probable epigenetic state. From these models we can get predictions about which DNA bases are most important for regulatory activity. However, most models are local in their predictions and only sensitive up to a certain genomic distance. They also have been trained on data from a reference genome assembly: the training/validation data comprise of different regions from the same genome, without explicit mapping across different genomes.
In this project we will consider experimental data (specifically histone mark QTLs, linking sequence variability to changes in epigenetic signal) to validate whether state-of-the-art models can pay attention to the right variants, and how this depends on distance to the affected sites. We will also encourage the creation of custom solutions, particularly to enable the use of existing models on new cell-types. Finally, we will aim to obtain predictions for the effects of Alzheimers-associated genetic variants on cell-type specific histone modifications.
Essential skills: Machine learning, predictive modelling, feature selection.
Useful skills: Sequence models, deep learning (CNN, RNN, transformers, feature attribution, transfer learning), understanding of gene regulation and omics data, multimodal data integration, causal inference.
Using machine learning to improve sleep habits in Dementia patients
People living with dementia (PLWD) have been known to suffer from sleep disorders, for example having difficulties falling asleep, waking up irregularly during the night, and waking up too early in the morning.
Our challenge explores whether data-driven learning tools can complement and personalise the generic advice to help PLWD sleep better.
This dataset contains more than 18,000 nights from multiple dementia patients who were monitored for two years with multiple sensors, like Withings Sleep Mat. This challenge will explore building a personalised recommendation engine that will allow domain experts to suggest possible interventions to promote better sleep techniques.
In the challenge, we will explore the effect on sleep metrics like total time spent in bed when different conditions are changed such as room temperature. More ambitiously, can the model suggest a set of gradual changes in environmental and sleep factors that could potentially allow people to regain a healthy sleep pattern? Personalized therapeutics depends on answers to these questions. Further, our setting is general enough to encourage the development of models that apply to beyond the dataset at hand.
Useful skills: Generative modelling (VAE, GAN), Time series prediction, machine learning, recommender system.
The following are further health research challenges provided by the University of Birmingham for participants to work on during the event:
Automated assessment of vascular perfusion
The microcirculation is critical to life as it is where oxygen is transferred from blood to the tissues. The state of the microcirculation is disrupted in severe illnesses such as trauma or septic shock (serious infection). Current methods of assessing resuscitation are based on evaluating the patient’s global, or macro circulation using advanced cardiac output monitoring. However, there is evidence from septic patients to suggest that where a patient exhibits improvement in their macro-circulation, their microcirculation may remain impaired, meaning the patient is not being adequately fluid resuscitated.
The microcirculation can be imaged directly using so-called dark-field microscopy to image the microscopic blood vessels under the tongue. These images, which take the form of short video sequences of blood flow in the micro vascular system of arterioles and capillaries under the tongue, can be used to assess the perfusion (blood content) of the tissue. This process is currently performed by a human expert but needs specialist equipment, is very labour intensive (45 minutes per video) and has high levels of variability between assessors. In this challenge, the goal is to automate this analysis and to predict a single measure of perfusion from a video with the aim of dramatically reducing analysis time so that assessments may be made in real time.
Useful skills: Computer vision and machine learning, image processing.
Improving the resolution of protein structure imaging
Understanding the function of proteins in the body often requires a detailed understanding of their structure. Modern tomographic imaging techniques using x-rays and electrons can enable protein structures (or more precisely, their electron density) to be visualised directly at nanometre-scale resolution in 3D. This is usually done on samples that contain many molecules and a major challenge is to identify repeated structures in the data which can, when averaged, improve the resolution of the resulting volumetric reconstruction.
This process, known as sub-tomogram averaging, is commonly used when there is a single known (or easily identifiable) repeating structure. The core of this challenge is to develop techniques that enable all of the repeating structures in the data to be identified. This would enable sub-tomogram averaging to be applied across multiple repeating structures which would lead to further resolution improvements. Ideally, this should be done without manual intervention.
Pattern identification is common in image analysis. However, the 3D images here are very large: 4000 x 4000 x 4000 pixels; the repeated patterns can be in any orientation; and manual identification of the patterns is exceptionally difficult. Brute force searching is therefore unfeasible. Recent advances in computer vision and machine learning hold promise here and it is these that we wish to explore.
Useful skills: Computer vision and machine learning with the following specific techniques of particular interest: self-supervised learning; representation learning; group-equivariant convolutional networks; pattern recognition.
How to apply
Applications must be submitted via Flexi-Grant. If you have not done so already, you will need to create a basic Flexi-Grant account. It is quick and free to register. Please be aware you will be required to activate your account via email. If you have any questions regarding the application form or using the online system please email [email protected].
Reasonable adjustments are changes that organisations must make for you if your disability puts you at a disadvantage compared with others who are not disabled. There will be an option to request reasonable adjustments in the DSG application form so that you can attend and participate in Data Study Group. For example, closed captioning. If you select yes and your application is successful we will contact you via email (or your preferred method) to discuss your individual circumstances. Do not worry if you are unsure if your query would qualify as a 'reasonable adjustment' - if in doubt tick anyway and we can discuss your situation with you directly.
The Alan Turing Institute recognises the under-representation that exists within data science. We are committed to increasing the representation of female, black and minority ethnic, LGBTQ+, disabled and neurodiverse researchers in data science and especially welcome these applications. We believe the best solutions to challenges result when a diverse team work together to share and benefit from the different facets of their experience. You can review our equality, diversity and inclusion (EDI) statement online.
About the event
What are Data Study Groups?
- Intensive 'collaborative hackathons' hosted at the Turing, which bring together organisations from industry, government, and the third sector, with talented multi-disciplinary researchers from academia. (Please note this format is currently different due to COVID-19).
- Organisations act as Data Study Group 'Challenge Owners', providing real-world problems and datasets to be tackled by small groups of highly talented, carefully selected researchers.
- Researchers brainstorm and engineer data science solutions, presenting their work at the end of the week.
The Turing Data Study Groups are popular and productive collaborative events and a fantastic opportunity to rapidly develop and test your data science skills with real-world data. The event also offers participants the chance to forge new networks for future research projects, and build links with The Alan Turing Institute – the UK’s national institute for data science and artificial intelligence.
It’s hard work, a crucible for innovation and a space to develop new ways of thinking.
Reports from previous Data Study Groups are available here.
Read our FAQs for Data Study Group applicants.
Find out more
Queries can be directed to the Data Study Group Team