Using data science and machine learning to develop cross-disciplinary analytical methods in human evolutionary studies

Project status



Research in human evolution has been transformed by the impact of genomics and the development of ancient DNA methodologies, with insights into past demography, dispersal and admixture patterns, social behaviour, selection, disease history, and more. The challenge for classical palaeo-sciences is to develop methodologies and data capture that matches the richness of genomic data. This project is designed to develop machine learning and data science methodologies for non-genomic data to provide a more comprehensive and integrated understanding of human evolution.

Explaining the science

Human evolution is a central research area in biology and anthropology, and has a history of research going back more than 150 years. For most of that time, evidence has come from digging up fossils and archaeological remains. This still remains central, but molecular genetics has revolutionised the field. Evolutionary genomic data is vast in scale, and complex in various ways. It is, however, tractable to very sophisticated analytical techniques, especially those developing under the banner of machine learning and data science. Archaeological and fossil data is less amenable, being complex, variable in nature and collected or published format, and often lost in older publications.

The methods to be developed in this project will aim to extract much of this data, and the work will develop ways in which this can be systematised into highly quantifiable and comparative forms. Where currently there is at best a visual or verbal comparison between genomic and non-genomic data in evolutionary science, the aim of this project will be to develop methods that will allow for analytical integration across data domains. This is particularly important as it has become increasingly clear that recent human evolution is a complex process of origination, isolation, dispersals, hybridisation and replacement. 

Project aims

Genomics has transformed human evolutionary studies as much as it has other parts of biology. One of the reasons for this impact is the sheer scale of data now available, and power of the analytical techniques used. Machine learning and data science have in effect, swamped traditional approaches to human evolution. However, the palaeosciences – palaeontology, archaeology, earth sciences – have a major role to play, supplying hypotheses, providing, contextual information and above all, providing evidence for the evolution of the phenotype and extended phenotype.

The major challenge is to develop data structures and analytical methods for these aspects that can be integrated with genomics. The aim of this project is to take up this challenge, and develop methods drawn from machine learning and data science that would greatly enhance the quantity and quantification of the complex data of the palaeosciences – morphometrics of fossils, attributes of the millions of stones tools that reflect hominin behaviour, environmental context and more. The data are in the form of books, papers, reports, and are in text, tabular and image form. These will require advanced algorithm-based input methods. Turning these into usable data will be based on classification of the features that will form the basis of the output data. Methods used will include string-searching algorithms, deep learning and computer vision. The primary output is to produce a widely applicable protocol/workflow from raw archived data to analysable database that can be applied widely to modern human evolution relevant data. 

The project will form the platform for integrating genomics and palaeo-phenotype data, and so greatly increase the range of analyses possible on the patterns and processes of human evolution. Human evolution is a central problem in biology, both for its intrinsic interest and for the implications for both the medical and cognitive sciences, and the relationship between humans and biodiversity overall. 


The humanities and historical sciences such as palaeontology and archaeology share a problem that data are complex, not easily quantified, and scattered in non-uniform sources. Data science and machine learning is being used to improve this situation, partly to advance the fields, and partly to strengthen open access. This project will have applications in this broader endeavour, making human evolutionary data available for analysis across many pure and applied fields of science.