Many crop species have multiple copies of their genome, and each copy of the gene can develop mutations that lead it to be expressed in different tissues and developmental stages. This project is developing machine learning tools to predict where and when each copy of the gene is expressed. This will allow the researchers to transfer experimentally-validated models from well-studied species with simple genomes (such as Arabidopsis or 'thale cress') into important crop species with multiple copies of their genomes (such as broccoli, cabbage or cauliflower).  

Explaining the science

This project is using a mixture of supervised and unsupervised learning techniques. The tricky bit comes from the fact that there are multiple gene copies that have very similar regulatory sequences, but very different expression patterns. Another challenge comes from balancing the need to develop interpretable models with the need to make very predictive models.

Project aims

The aims of the project are to:

  • Develop models to predict where regulatory proteins bind to DNA and use these models to annotate new genomes from the Brassica family.
  • Analyse the evolution of the regulatory sequences that control the expression of genes in each genome copy, using unsupervised learning techniques.
  • Develop models to predict the gene expression pattern (spatial and temporal profiles) of all gene copies found in members of the Brassica family. These models will take as input either (a) the raw DNA sequence of regulatory sequences or (b) the locations where regulatory proteins are predicted to bind to DNA.  
  • Analyse how transferable each model is across different species in the Brassica family.


This work is important for agriculturally-relevant plants, such as cabbage, broccoli, cauliflower, turnips, and Brussels sprouts, which are part of the Brassica family. Brassicas are closely related to the model plant species Arabidopsis or 'thale cress', which is well-studied.

In Arabidopsis, we know the function of many genes which might be good targets for directed breeding programs to increase yield or resilience to environmental stress. However, Brassicas have many copies of each gene, so it is unclear which of these gene copies should be the primary target of directed breeding initiatives. If we can predict when and where each gene copy is expressed, we can identify which gene copy is more likely to be relevant for breeders.

For instance, lets say we know a gene in Arabidopsis that helps control how much water is released from leaves, affecting the ability of the plant to withstand drought. A crop might have four copies of this gene, but it may be that only one copy is expressed in adult leaf cells; that specific gene copy might be the most promising target for breeders.  


Researchers and collaborators

Contact info

[email protected]