Introduction
Applications are now closed.
Remote format
Due to COVID-19 the Data Study Group will run remotely over three weeks and will be divided into two stages.
Stage 1: The Precursor Stage (part-time)
- The precursor stage will last one week in the run up to the 'event stage' (12 - 16 April).
- The maximum time commitment is 2.5 hours a day.
- Online workshops/ presentations/ team building in order to prepare for the ‘event stage'.
Stage 2: The Event Stage (full-time)
- The 'event stage' will run over two weeks (19 - 30 April).
- The core working hours will be 9:00 - 17:00 GMT every weekday, however flexibility will be demonstrated with regards to those participating in different time zones.
- Group work begins and continues throughout.
Applicants should be able to commit to the duration of the event. The Alan Turing Institute is committed to supporting individual circumstances, please do not hesitate to email [email protected] to discuss any reasonable adjustments.
Challenges
Our challenges and data sets are provided by partner organisations for researchers to work on during the event.
The organisations and challenges leading the Data Study Group this April are:
- Department for Work and Pensions (DWP) - The assessment of utility and privacy of synthetic data
- CityMaaS - Making travel in cities accessible for people with disabilities through prediction and personalisation
- Entale - Recommendation Systems for Podcast Discovery
- Odin Vision - Exploring AI supported decision-making for early-stage diagnosis of colorectal cancer
- University of Sheffield / Advanced Manufacturing Research Centre (AMRC) – Predictive maintenance of robotic machining tools using acceleration and force data.
Please see below for further details on each challenge
The skills that we think are particularly relevant to the challenges for this Data Study Group are listed under each challenge description below. Please note, the lists are not exhaustive and we are open to creative interpretation of the challenges listed. Diversity of disciplines is encouraged, and we warmly invite applications from a range of academic backgrounds and specialisms.
DWP - The assessment of utility and privacy of synthetic data
Given the risks of re-identification of sensitive and personal data and the delays inherent in making such data more available to analysts and data scientists, synthetically generated data is a promising alternative or addition to standard anonymization procedures. The objective for this challenge is to assess a number of synthesised datasets, and determine which existing metrics, or new metrics are most suitable when measuring data utility, disclosure and privacy. The data utility metrics will be based on general notions as well as actual use cases for DWP. Part of the focus for the DSG will be to determine what the best data utility metrics and information disclosure metrics are in order to evaluate a smalll number of synthetic datasets, provided by the Challenge Owner, versus the ground-truth. In addition, DSG participants will be able to explore how utility varies as privacy is varied, by comparison with open synthetic data generation algorithms.
Useful skills (one or more of these): Bayesian statistics, Bayesian inference, probabilistic machine learning, deep learning, information theory, uncertainty quantification, time-series modelling, privacy-enhancing technologies, differential privacy.
CityMaaS - Making travel in cities accessible for people with disabilities through prediction and personalisation
Travelling can be a challenging but rewarding activity especially for people with additional needs. One of the most challenging parts of daily life for disabled people is travel. Accessible travel is inefficient: on average people with limited mobility take 2.5 times longer to complete the same journey compared to able bodied people due to poor infrastructure, such as a lack of step-free access, and real-time accessibility information. Providing accurate information about the accessibility of locations they might visit and which route would best suit their personal needs and preferences thus poses a fascinating challenge for researchers. This supports people with disabilities not only in their daily lives but also allows for more independence and freedom.
CityMaaS is a company that aims to address the needs of 1 billion disabled people globally in a way that provides strong social impact through scalable software-as-a-service (SaaS) business models. The CityMaaS Mobility Map uses algorithms that predict accessibility information for any point of interest and personalises routing to optimise the travel experience. The purpose of this project is to address both of these parts of the platform, by improving predictions of POI accessibility and the personalisation of routes.
There are two main research questions of interest: prediction and routing. Participants are encouraged to organise themselves into working groups and explore the research question most interesting to them.
To address the prediction question researchers will make use of a well-curated data set of 3 million Points of Interest (POI) consisting of the geographic location, accessibility, and type of points of interest to develop and test machine learning models for predicting the accessibility of new points. Performance may be improved using public data such as image, text or data from the built environment.
Accessibility and comfort of an entire route is user dependent and affected by multiple factors including walking distance, steps, and busyness, all of which affect the urban experience. The challenge is to engineer additional features that are associated with each route that capture user experience and thus allow recommendations to be made for individual end users.
Useful skills: urban planning, geospatial analysis, machine learning (esp. classification), handling large data sets and human computer interaction, using public data sets for augmentation.
Entale - Recommendation Systems for Podcast Discovery
Entale is a podcast platform that allows you to follow up on what’s mentioned in your podcasts without having to search for it. Read the profile of a trending show host, see the map of an obscure location, watch the trailer of a new movie, and much more. Our next goal is to use what’s mentioned to recommend new podcasts, an area where podcasting is famously lagging behind other creative media.
In this challenge, you will devise a method for capturing relationships between podcasts and what’s mentioned in them and work towards building a podcast recommendation system.
To this end, we provide a representative snapshot of the podcast ecosystem including transcriptions of podcasts, mentioned entities, and external information associated with these entities. Our dataset is but a static snapshot of an ever-evolving system, ranging from mainstream high-production shows to niche grassroots shows. As an alternative, we also provide a small and incomplete sample of user listening statistics that can be used to create personalized recommendations.
Participants could begin by exploring the potential ways of encoding the podcast meta-data into a feature space, either learnt using deep learning or derived from more traditional feature embeddings. Alternatively, one could build an online topic model to recommend new podcasts or items within based on the similarity to the topics that have been of interest to the listener previously. Or one could even construct a graph of the relationship between all of the podcasts and the associated items using both topic models and the feature embeddings and let the listener explore it.
Useful skills: Our challenge is a good fit for machine learning generalists with a passion for podcasts and/or the creative industry. We especially welcome participants with experience in one or more of these areas: knowledge-based systems, causal inference, inference from incomplete data, Natural Language Processing, text mining, deep learning, recommender systems.
Odin Vision - Exploring AI supported decision-making for early-stage diagnosis of colorectal cancer
In the UK, there are over 42,000 new cases of colorectal cancer (CRC) and 16,000 related deaths per year, making it the second leading cause of cancer deaths, according to Bowel Cancer UK. Odin’s technology supports the detection and characterisation of gastrointestinal diseases in real-time using machine learning. We believe that the best patient care combines the power of artificial intelligence with real-world experience of clinical experts. We aim to avoid black-box predictions and instead facilitate augmented decision-making by delivering interpretable results to clinicians to support diagnosis.
Given multiple datasets of colorectal polyp images, histopathologically determined diagnostic labels and optical feature labels, candidates are invited to explore methods to make diagnostic predictions more interpretable for clinicians. We anticipate that this will be realised through a number of possible avenues:
- Learned representations that tease out/disentangle informative features relevant for diagnosis, e.g. polyp vessel structure, surface patterns or colour
- Multi-instance learning with weak assignment of images to relevant optical features
- Gradient-based tools to interpret deep learning methods, such as guided grad-cam, integrated gradients etc.
- Bayesian methods to quantify predictive uncertainty along with innovative ways to communicate this information to clinicians
Useful skills: Machine learning, interpretability, computer vision, deep learning, representation learning, multi-instance learning, unsupervised learning, Bayesian methods, generative models (VAE, GANs), disentangled representations.
University of Sheffield / Advanced Manufacturing Research Centre (AMRC) – Predictive maintenance of robotic machining tools using acceleration and force data
Advanced Manufacturing Research Centre (AMRC) will provide participants a robotic manufacturing process dataset including a full-life-cycle (from normal to damage) unlabelled multivariate time-series dynamic signal, i.e. the acceleration and force, of one specific manufacturing robotic arm. Due to the inherent flexibility of industrial robotic arms, their tool predictive maintenance is even more challenging due to their variable and dynamic working condition compared with ‘static’ machining systems like the traditional tuning and milling machines. Participants are invited to investigate whether modern data science and AI techniques, e.g. time-series analysis, anomaly detection, domain adaptation, LSTM, etc. can help extract health status related metrics or features which could be used to predict the failure and remaining useful life (RUL) of robotic machining tools to help reduce factory downtime and improve manufacturing productivity.
Useful skills: Time-series analysis, outlier detection, neural networks for time series, unsupervised learning, signal processing.
Wearable sensors - Activity recognition using wrist-worn accelerometers
Recent years saw a surge in the adoption of smartwatches with activity trackers (e.g. Fitbit and Apple Watch). This enables analyses of human movement and physical activity levels at the population scale, which are of great value for epidemiological studies such as understanding obesity epidemics and informing physical activity guidelines. However, technical challenges remain in the processing and analyses of activity tracker data. Current approaches include the development of activity recognition models that translate the accelerometer readings into activity labels (e.g. walking, running, sitting). Most of these models are developed using labelled data collected in a lab setting, which has a number of limitations such as very short measurements, a limited set of pre-specified activities, and the absence of hybrid movements.
In this challenge, we ask participants to develop an activity recognition model using unique accelerometer data collected in a free-living setting. The data was collected from around 150 participants who wore an accelerometer for 24hrs of their normal life along with a bodycam during the daytime, making it the largest labelled accelerometer dataset collected in free-living environments. Participants are also encouraged to think about and comment on potential ethical issues with the data, and time permitting, consider possible solutions such as the use of synthetic data generation methods. Successful methods developed during this challenge can prompt further epidemiological research with the potential to influence global health guidelines.
Useful skills: Time-series analysis, machine learning, classification models, generative models, scikit-learn, PyTorch.
How to apply
Applications must be submitted via Flexi-Grant. If you have not done so already, you will need to create a basic Flexi-Grant account. It is quick and free to register. Please be aware you will be required to activate your account via email. If you have any questions regarding the application form or using the online system please email [email protected].
The Alan Turing Institute recognises the under-representation that exists within data science. We are committed to increasing the representation of female, black and minority ethnic, LGBTQ+, disabled and neurodiverse researchers in data science and especially welcome these applications. We believe the best solutions to challenges result when a diverse team work together to share and benefit from the different facets of their experience. You can review our equality, diversity and inclusion (EDI) statement online.
About the event
What are Data Study Groups?
- Intensive five day 'collaborative hackathons' hosted at the Turing, which bring together organisations from industry, government, and the third sector, with talented multi-disciplinary researchers from academia. (Please note this format is currently different due to COVID-19).
- Organisations act as Data Study Group 'Challenge Owners', providing real-world problems and datasets to be tackled by small groups of highly talented, carefully selected researchers.
- Researchers brainstorm and engineer data science solutions, presenting their work at the end of the week.
Why apply?
The Turing Data Study Groups are popular and productive collaborative events and a fantastic opportunity to rapidly develop and test your data science skills with real-world data. The event also offers participants the chance to forge new networks for future research projects, and build links with The Alan Turing Institute – the UK’s national institute for data science and artificial intelligence.
It’s hard work, a crucible for innovation and a space to develop new ways of thinking.
Reports from previous Data Study Groups are available here.
FAQs
Read our FAQs for Data Study Group applicants.
Find out more
Learn more about being a DSG participant including FAQs
How to write a great Data Study Group application
Queries can be directed to the Data Study Group Team