Applications are now closed.
Due to COVID-19, our September DSG will run remotely over four weeks and will be divided into two stages.
Stage 1: The Precursor Stage (part-time)
· The precursor stage will last two weeks in the run up to the 'event stage' (31 August – 11 September).
· Maximum time commitment two hours a day (not every session will be mandatory).
· Online workshops/ presentations/ team building in order to prepare for the ‘event stage'.
Stage 2: The Event Stage (full-time)
· The 'event stage' will run over two weeks (14 – 25 September).
· Working hours 9-5 pm every week day.
· Group work begins and continues throughout.
September 2020 Challenges
Our challenges and data sets are provided by partner organisations for researchers to work on during the event.
The organisations and challenges leading the Data Study Group this September are:
- CRUK Cambridge Institute - Modelling interactions driving breast cancer development
- Greenvest Solutions - Forecasting wind energy production using satellite data
- catsAi - Communicating high-street bakery sales predictions using counterfactual explanations
- University of Strathclyde and Supergen Energy Networks Hub - Using machine learning to predict the onset of blackouts
Please see below for further details on each challenge
The skills that we think are particularly relevant to the challenges for this Data Study Group are listed under each challenge description below. Please note, the lists are not exhaustive and we are open to creative interpretation of the challenges listed. Diversity of disciplines is encouraged, and we warmly invite applications from a range of academic backgrounds and specialisms.
CRUK Cambridge Institute - Modelling interactions driving breast cancer development
The Cancer Research UK (CRUK) Cambridge Institute is making available a comprehensive dataset of gene expression in 400 Estrogen Receptor (ER) positive breast cancer cell line samples, this includes control experiments and perturbations in the form of gene knockdowns. In the cell lines considered ER is the main driver of breast cancer. Participants are invited to explore the role of perturbation targets –using AI and machine learning tools– in the development of breast cancer, this work will allow devising of new interventions to halt this process. No experience working with biological data is required.
Useful skills: Primarily novel network methods to analyse the data. No-one is expected (or likely) to have experience of all of the following, but skills may include: Network inference, neural networks, Bayesian networks, graph-based models, causal models, correlation/regression-based networks, Boolean Implication Networks, Nested Effect Models, Linear Effect Models, techniques to validate model robustness – e.g. boot-strapping, data visualisation.
This work is supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the “Data Science for Science” theme within that grant and The Alan Turing Institute.
Greenvest Solutions - Forecasting wind energy production using satellite data
Greenvest is on a mission to accelerate renewable energy adoption worldwide. The start-up provides strategic technology to plan, monitor, and assess clean energy projects globally.
When deciding where to build a new wind farm, a company should assess the wind resources of an area with the minimum uncertainty possible. This is usually done in two steps, firstly, a preliminary assessment is carried out utilizing low-resolution wind maps which are derived from ground stations. Secondly, at least one met(-eorological) mast is installed at the most promising locations to record at least one year of current wind data.
This approach presents several problems. One one hand, even if the wind maps encompass a large temporal frame, they are often inaccurate and poorly interpolated to the desired location. On the other hand, met masts offer precise measurements of wind but they are expensive to be set up and may produce an inaccurate prediction for the long term production of the wind farm due to the year-by-year variability of wind resources.
A newer approach to this problem is to use mesoscale models and computational fluid dynamics to interpolate satellite-derived data but this is a computationally expensive method that often underperforms for complex terrains and coarse grids of satellite data. A promising solution is to use Machine Learning to predict wind resources at a certain location starting from satellite measured data.
Thus, the aim of this challenge is to forecast wind resources close to the ground for a specific location. The key difficulty is to understand how terrain data and surface roughness could be used to train a model that interpolates the satellite data to the desired location.
This could be achieved by combining ML, big data, and geospatial expertise to predict the amount of wind resource. To reach this goal, we will attempt to leverage time-series of wind data, atmospheric as well as digital surface models and surface roughness, to make an overall prediction.
Useful skills: Time series, functional data analysis, satellite images, machine learning, neural networks.
catsAi - Communicating high-street bakery sales predictions using counterfactual explanations
“Communicating high-street bakery sales predictions using counterfactual explanations”
Many business decisions start with a simple question; “How many will I sell?”. Sales are influenced by many factors, including location, product-type and weather. catsAi will provide access to a comprehensive dataset of historical sales and meteorological data across 1000s of bakery sites. Participants are invited to investigate whether data science and AI can identify factors influencing sales which are poorly defined or as yet undiscovered, and how counterfactual explanations can be applied to promote adoption and trust in these predictions?
Useful skills: programming skills, machine learning, spatio-temporal analysis.
University of Strathclyde and Supergen Energy Networks Hub - Using machine learning to predict the onset of blackouts
Electrical power systems are highly non-linear, dynamical and complex systems, making the investigation of their dynamic behaviour very challenging, especially under increasingly uncertain operation introduced by renewable energy sources on our way to tackling climate change. One of the core mechanisms that can lead to blackouts is the sequential disconnection of power system components, commonly referred to as cascading failures. In this challenge we are interested in investigating the potential of machine learning to predict such events early on at their onset, using a provided dataset of detailed simulated time domain responses.
Useful Skills: Experience with relevant machine learning methods to deal with time domain data, Computer science/machine learning background, Enthusiasm & Python.
About the event
What are Data Study Groups?
- Intensive five day 'collaborative hackathons' hosted at the Turing, which bring together organisations from industry, government, and the third sector, with talented multi-disciplinary researchers from academia. (Please note this format is currently different due to COVID-19).
- Organisations act as Data Study Group 'Challenge Owners', providing real-world problems and datasets to be tackled by small groups of highly talented, carefully selected researchers.
- Researchers brainstorm and engineer data science solutions, presenting their work at the end of the week.
The Turing Data Study Groups are popular and productive collaborative events and a fantastic opportunity to rapidly develop and test your data science skills with real-world data. The event also offers participants the chance to forge new networks for future research projects, and build links with The Alan Turing Institute – the UK’s national institute for data science and artificial intelligence.
It’s hard work, a crucible for innovation and a space to develop new ways of thinking.
Read our FAQs for Data Study Group applicants.
Find out more
Queries can be directed to the Data Study Group Team