Data Study Group – December 2019

Bringing together top talent from data science, artificial intelligence, and wider fields, to analyse real-world challenges

Learn more Add to Calendar 12/09/2019 10:00 AM 12/13/2019 05:00 PM Europe/London Data Study Group – December 2019 Location of the event
Monday 09 Dec 2019 - Friday 13 Dec 2019
Time: 10:00 - 17:00

Event type

Data Study Groups

Event series

Data Study Groups


Applications are now closed.

Join our Data Study Group mailing list to be notified when the next call for applications opens.

About the event

What are Data Study Groups?

  • Intensive five day 'collaborative hackathons' hosted at the Turing, which bring together organisations from industry, government, and the third sector, with talented multi-disciplinary researchers from academia.
  • Organisations act as Data Study Group 'Challenge Owners', providing real-world problems and datasets to be tackled by small groups of highly talented, carefully selected researchers.
  • Researchers brainstorm and engineer data science solutions, presenting their work at the end of the week.

Why apply?

The Turing Data Study Groups are popular and productive collaborative events and a fantastic opportunity to rapidly develop and test your data science skills with real-world data. The event also offers participants the chance to forge new networks for future research projects, and build links with The Alan Turing Institute – the UK’s national institute for data science and artificial intelligence.

It’s hard work, a crucible for innovation and a space to develop new ways of thinking.

Reports from a previous Data Study Group are available on the Outcomes section of the April 2018 Data Study Group.


Our challenges and datasets are provided by partner organisations for researchers to work on over the week.

The organisations and challenges leading the Data Study Group this December are (see the challenge descriptions below):

  • WWF – Smart monitoring for conservation areas
  • Dstl – Anthrax and nerve agent detector: Identification of hazardous chemical and biological contamination on surfaces using spectral signatures
  • Dstl – Bright-field image segmentation
  • The National Archives – Discovering topics and trends in the UK Government Web Archive
  • Agile Datum – Automating the evaluation of local government planning applications
  • SenSat – Semantic and Instance Segmentation of 3D Point Clouds

The skills that we think are particularly relevant to the challenges for this Data Study Group are listed under each challenge description below.  Please note, the lists are not exhaustive and we are open to creative interpretation of the challenges listed. Diversity of disciplines is encouraged, and we warmly invite applications from a range of academic backgrounds and specialisms.

Challenge descriptions

WWF – Smart monitoring for conservation areas

WWF monitors over 250,000 protected areas, thousands of other sites and critical habitats (i.e. coral reefs, mangroves). These sites are the foundation of our global natural assets and are central to the preservation of biodiversity and human wellbeing. 

Unfortunately, they face increasing pressures from human development. A growing challenge for the conservation movement has been to consistently and timely identify;

1) proposed or emerging development within key sites. 
2) the stakeholders involved. 

The timely provision of this actionable information is vital to enable WWF and the conservation community to engage with governments, companies, shareholders, insurers, etc. to help limit the degradation or destruction of key habitats. 

To improve the situation WWF with The Alan Turing Institute now looks to develop the first near-real-time assessment for key sites, starting with a pilot on the flagship protected areas, World Heritage sites (i.e. Serengeti National Park). 

The challenge will use Google News API and different text-data mining techniques to identify current and potential infrastructure pressures across all 244 natural World Heritage Sites. We intend to explore news scrapping, different text-data mining techniques, to prioritise the most relevant information (i.e. Bayesian filters, sentiment analysis) and to extract new information to refine and improve targetting. This information will be integrated into WWF's global GIS mapping platform. 

If successful, WWF plans to scale it to a wider group of conservation assets (250,000+) and provide open access to the data generated to support the conservation community.

Useful skills: Natural language processing (NLP), Topic modelling, Text classification, deep learning, Toponym resolution, georeferencing, web data extraction.

This work is supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the “Data Science for Science” theme within that grant and The Alan Turing Institute.


Dstl – Anthrax and nerve agent detector: Identification of hazardous chemical and biological contamination on surfaces using spectral signatures

The assessment of surfaces for potential contamination by biological (e.g. anthrax pathogen Bacillus anthracis) and chemical (e.g. nerve agents such as VX) hazards is relevant for a range of military and civilian applications. To this end, Dstl and the Defence and Security Accelerator (DASA) are providing a dataset collected using a range of different sensor modalities that have measured various surfaces contaminated with surrogate bacteria, hazardous chemicals and relevant control materials. Both un-mixing and identification of the contaminant contribution from that of the underlying surface is non-trivial. 

Participants are invited to explore how data science and machine learning techniques can be applied to recognise and discriminate between the various contaminants based on data from individual sensors or fusion of multiple data sources, and how models can be applied to characterise contamination on new surfaces without re-training.

Useful skills: Supervised learning, classification, feature extraction, factor analysis, multivariate analysis, spectroscopy, chemometrics, metabolomics, rule based methods, evolutionary algorithms, functional data analysis, machine learning, transfer learning.


Dstl – Bright-field image segmentation

Light/Confocal microscopy is a tool many biomedical researchers use to gather data about their field of interest, including study of disease and infection. Automated analysis of these images typically starts with cellular segmentation (the process of identifying individual cells within images). This is routinely done using high contrast fluorescent labels for either the cytoplasm or plasma membrane. Segmentation using label free modalities such as transmitted light/bright-field microscopy is advantageous because it is less phototoxic than fluorescent imaging and removes the need for labels which may affect the function of the cells.

To the human eye cells are easy to identify on a transmitted light image, however due to the similarities in pixel values between the background and cell events segmentation by computational analysis is still a real challenge. This DSG challenge invites you to utilise the power of AI to design methods to segment cells from confocal microscopy datasets of human/murine immune cells infected with various pathogens.

Useful skills: Image processing/analysis, machine learning, strong programming skills, experience with frameworks such as Keras, Py-Torch and TensorFlow desirable, an interest in applying computational methods to biological problems.


The National Archives – Discovering topics and trends in the UK Government Web Archive

The UK Government Web Archive (UKGWA) is a vast resource of governmental websites and social media spanning 23 years; an important source of recent national history. We would like to use machine learning to enable search and discovery of this vast archive, either through browsing by subject area, or finding similar material to subset of documents. The aim is to build an algorithm capable of identifying like documents and inferring the likely topics that they cover (e.g. “climate change”, “immigration”, “healthy living” or “international relations”). This is a diachronic corpus that is ideal for studying the emergence of those topics and their permeation through the government websites over time.

By looking at the emergence and decay of selected topics across different domains and government websites, it will indicate engagement priorities and how these change over time. The vast majority of documents (1-2 million in this sub-sample) are likely to have a mix of content, therefore, the approach should be capable of asserting the degree of match.

Useful skills: Natural language processing, language modelling, computational linguistics, machine learning, data science, digital humanities.


Agile Datum – Automating the evaluation of local government planning applications

There are 3,500,000 planning applications each year in the UK, from building a whole estate, to extending a house or fitting new windows on a listed building, to chopping down a tree. In most cases citizens need to apply to the council for permission. However, the forms are complex and the plans you need to submit are detailed. On average it takes three weeks for a council to start looking at a planning application and 30 minutes to do basic checks of the forms and drawings, by which time 40% are rejected due to 10 common errors.

This challenge is to automatically analyse the forms and drawings to identify these 10 common errors and to inform the citizen in seconds rather than a three week delay, saving the council 30 minutes per application (250,000 man days a year in the UK).

Useful skills: Supervised learning, classification, computer vision, optical character recognition, natural language processing, hierarchical modelling, unstructured data.


SenSat – Semantic and Instance Segmentation of 3D Point Clouds

Autonomous vehicles require digital maps to avoid possible collision and navigate safely; smart cities require knowledge of urban features to be managed appropriately; and digital twins require physical assets to be recognised before they can simulate predictive models. All those applications require a detailed representation and understanding of the spatial environment. SenSat captures high resolution images via drones with a ground sampling distance of ~2.5cm. Those images are then transformed into 3D point clouds, using techniques such as Structure from Motion (SfM).

Point clouds are unstructured and unordered data representing the real word with XYZ and RGB values. Although visually rich, these point clouds have limited spatial context associated for algorithms to extract meaningful information. In order to extract the spatial context, techniques such as point cloud segmentation and classification is commonly explored. This allows computers to recognise the composition of the 3D scene. However, the lack of effective semantic and instance segmentation techniques is currently acting as a blocker in this industry. 

This DSG invites participants to explore point cloud segmentation techniques, both semantic and instance, in order to recognise objects such as roads, buildings, cars, trees, and ground in a large 3D urban environment. This will enable safer autonomous vehicles on the road, automated asset management in urban planning, and accurate digital twin simulations.

Useful skills: Computer vision, 3D scene understanding, machine learning, CNNs, Keras, PyTorch, TensorFlow, Tree data structures, data science and programming skills with a hunger to push current techniques to their limit.

York map


The Alan Turing Institute will cover travel costs in alignment with our expenses policy. We will also provide accommodation for researchers not normally London-based. Accommodation for researchers who are from a London university or research institute may be available for those who travel from outside London to work. Expenses for international applicants is capped at £200, which includes any costs of visa. Lunch and dinner is provided for participants during the week. 

The Data Study Group involves a commitment of up to 50 hours across five days and participants are expected to attend the full duration of the event where possible. (While DSG does involve long hours, The Alan Turing Institute is committed to supporting individual circumstances, please email [email protected] to discuss any adjustments you may require).

The Alan Turing Institute is committed to increasing the representation of female, black and minority ethnic, LGBTQ+, disabled and neurodiverse researchers in data science. We believe the best solutions to challenges result when a diverse team work together to share and benefit from the different facets of their experience. You can review our equality, diversity and inclusion (EDI) statement online


Find out more

How to get involved as a researcher

How to write a great Data Study Group application

Queries can be directed to the Data Study Group team