When a crime occurs, crucial information about the event is captured within the narrative write up of the event. This information is currently not systematically analysed due to issues with scale, but modern natural language processing techniques, such as topic modelling, offer a rapid, scalable means of identifying insights. This project has developed an approach that utilises 'latent Dirichlet allocation' to cluster crime reports based on the narrative text that highlights specific modus operandi and could be visualised through a dashboard application.

Explaining the science

The principal algorithm used in this project is called 'latent Dirichlet allocation' (LDA). This is a generative statistical model that is used in topic modelling to identify latent topics from text documents based on word co-occurrence.

LDA assumes that each document within a corpus (collection of documents) is a mixture of topics and that each topic is a mixture of specific words. This is intuitive if we think of any text document such as a newspaper article: it may describe an overarching theme, such as Brexit, but each paragraph in the article may relate to a difference facet of that theme such as trade, or negotiations, or legalities. Each of these are latent topics within the newspaper article that are distinguished by LDA through the co-occurrence of specific words within those sections.

This project dealt with sensitive text information captured by police officers from crime scenes. Data provided was prepared by Safer Leeds to meet their own internal quality standards and was only the first 18 lines of narrative text (which is also disclosable in court).

Map of Leeds with areas of crime
Screenshot of the dashboard application produced in this project showing geographical distributions of specific crime topics.


Project aims

This project has tested a variety of natural language processing methods in an attempt to identify insights for organisational operation and strategic planning, for police and community safety partnerships.

This project hoped to develop a more systematic approach to analysing the vast amounts of free text data captured by the police.

Success was defined as demonstrating to project partners at Safer Leeds, using a proof-of-concept, that it was possible to derive useful insights from their text data. Efforts such as those undertaken here are crucially important in an age where police services work in a constrained budgetary environment, providing highly specific insights to aid with crime prevention. 


This work was applied in the context of police data from West Yorkshire Police but could easily be extended to other police forces and public agencies that capture a significant amount of text data such as in probation,  and child and social services.

Recent updates

June 2019

This project has now been finished as part of the LIDA internship but work is ongoing with Safer Leeds to deploy much of the code onto their systems for regular use.


Researchers and collaborators

Contact info

[email protected]