Machine learning and large cryogenic electron microscopy data sets

Applying machine learning techniques to identify interacting protein molecules in large cryogenic electron microscopy (cryo-EM) data

Project status



Cryo-EM is a technique that allows you to directly visualise frozen proteins. Huge data sets are produced with one session on the microscope, often giving you up to 8TB of image data. Powerful, well developed software exists for the analysis of data sets of this size when the object of study conforms to a certain set of constraints. However, there are many harder to reach systems. The main objective of this project is to create a self-contained image processing tool to facilitate the analysis of these more difficult cases.

Explaining the science

In 2017 the Nobel Prize for Chemistry was given to three pioneers who established the technique of cryo-electron microscopy. This technique can be briefly described as a process of rapidly freezing proteins, imaging them in an electron microscope and calculating three-dimensional maps from the two-dimensional images obtained. The high speed of freezing prevents crystalline ice formation which would perturb the molecular structure.

The interaction of electrons with the proteins gives rise to 2D projection images which contain information about the three-dimensional structure. The 3D map is calculated in a similar way to a CT scan in a hospital, using back-projection methods that combine X-rays at known scanning angles. Despite the low signal to noise ratio, the image processing techniques of classification, averaging and angle assignment allow the relative orientations of the particles to be determined and a 3D map calculated.

Project aims

The aim of this project is to open the field of high resolution cryo-EM to a large number of new users and facilitate the analysis of large data sets that are now routinely being produced.

The project is creating an image processing tool that automatically identifies filamentous proteins in an image and locates the region of interest, an accessory or binding protein. Once the regions of interest have been located, segmentation/boxing of the filament can occur and ‘particles’ can be extracted. These selected particles can then be fed into existing image processing packages.

This software will put the possibility of high resolution structural studies within the reach of many new fields of study. Increasing access to this fast-moving field, driven by large amounts of data, will impact on the research and discovery possible. 

Quicker analysis and more rapid output of high resolution 3D structures will inform our understanding of normal function of proteins and disease mechanisms. High resolution EM structures are also rapidly becoming an important tool in structure-based drug design helping to accelerate drug discovery with wide reaching societal impact.


The software developed will be designed to be used on any filamentous system that has sparsely bound globular accessory proteins. As a result, the number of systems it could be used to interrogate structurally is vast. 
The main focus of the work is determining the 3D structure of cardiac thin filaments, trying to understand heart disease on the molecular level, where structural information will aid understanding of cardiovascular disease and normal function of the heart. 
Specifically, the project is looking at the structure of a protein called troponin, sometimes called 'the protein that switches muscle on'. Troponin is the site of a number of mutations that are known to cause Hypertrophic Cardio Myopathy (HCM) an inherited heart disease that is the leading cause of sudden death in young adults. Many fundamental questions about this troponin's structure and function need to be answered.


Researchers and collaborators