Introduction
The goal of the collaboration was to port text analysis codes which were developed in order to exercise the data access, transfer and analysis services of the Turing's deployment of a Cray Urika-GX analytics system. Two data sets of interest to the University of Edinburgh's College of Arts Humanities and Social Sciences (CAHSS) and hosted within the University of Edinburgh were used: British Library digitised newspapers from the 18th to the early 20th century; and British Library digitised books data from the same period. Once ported, the text analysis codes were extended to support new queries across the datasets.
Project aims
Two text analysis codes were used, one for querying each each dataset, which were initially developed by UCL with the British Library in 2015-2016. UCL's codes are written in Python and runs queries via Apache Spark. The code was originally designed to run queries on a user’s local machine or on UCL’s high performance computing (HPC) services.
To run the codes within Urika, both codes were modified so that they could run without any dependence UCL’s local environment, and instead access data located within Urika. An older version of UCL's code - pre-dating Apache Spark and which uses the message-passing interface (MPI) for parallel programming - was also run on Urika to generate sample query results for queries that were then migrated to Spark.
New queries were implemented at CAHSS's request to search for occurrences of keywords (e.g. "Krakatoa" or "Krakatua") and their concordances (the text within which the words are found) co-located words (e.g. "stranger" and "danger"). Support for an additional dataset - New Zealand newspapers - was also added.
Visualisations were also developed in Jupyter notebooks to present query results as N-grams, graphs of occurrences by year, and word clouds.
Applications
- The codes are now being extended to help CAHSS understand female emigration from 1850-1914.
- The work is now being refactored for use in the Living with Machines project, of which the Turing is also a partner.
- The work is applicable to any domain that wants to perform rich text searches across large volumes of historical documents that have been scanned into, or manually entered into, a machine-readable format.
Recent updates
August 2019
Analysing historical newspapers and books using Apache Spark and Cray Urika-GX, EPCC blog post, 16 August 2019
January 2019
- defoe, a refactoring of the books and newspapers codes into a single tool: GitHub repo
- defoe_visualisations, a complementary collection of Jupyter visualisations for presenting results of queries run via defoe: GitHub repo
December 2018
Analysing humanities data using Cray Urika-GX, EPCC blog post, 11 December 2018.
Organisers
Researchers and collaborators
Dr Rosa Filgueira
Data Architect, EPCCDr Michael Jackson
Software Architect, EPCCDr Anna Roubíčková
Applications Developer, EPCCContact info
Rosa Filgueira
[email protected]