Introduction
The right maps can be hard to find. We envision a future where map collections can be searched based on their spatial content, similar to the way that digitised newspaper collections enable full-text searching across scanned pages. This project contributes to reversing the fortunes of historic maps at the moment when many of them are being made available online.
Header image courtesy of the National Library of Scotland
Explaining the science
The project will refine an already robust tool for extracting text from maps (Strabo), developed by Chaing and colleagues on the Linked Maps project. Two open challenges are at the heart of this project: detecting text on scanned maps and making that text meaningful through linking with external knowledge bases that define semantic types (e.g., mountains, towns, roads). The first task, ‘optical character recognition (OCR) for maps’, has been the focus of Strabo and will be tested here on new collections with the aim of improving its performance on generating complete text phrases and documenting fine-tuning practices. The second task involves three steps: a) gazetteer mapping, or matching text strings to existing Linked Open Data gazetteers; b) classifying map text per historically-appropriate and research-driven semantic categories; and c) further linking enriched map text to other data with geospatial attributes. We will examine labels (town names, footpaths, mines, etc.) as well as text outside the neatline, or collar, of the map (title, surveyors, or print dates, for example).
Combining Strabo with entity linking and image annotation services will enable swift analysis of content without sacrificing the context of the mapping process. Curating text on maps as open datasets and linking them to external knowledge bases (e.g., Wikidata or gazetteers) enables tasks like improving historical gazetteers or documenting place name histories, and, in turn, helps support complex queries for finding and indexing historical maps, such as retrieving all historical maps naming mountain peaks higher than 1,000 meters in California or finding areas with a 1-mile radius that contain five churches built before 1800, plus a train station.
Recognising the need for correction and curation of map text extracted through automated means, the project (in collaboration with another Turing project: Living with Machines) is prototyping methods for annotating text on maps using Recogito, the award-winning online environment for collaborative annotation. Recogito is the benchmark for text and image annotation platforms in the digital humanities, and this project supports its ongoing development and strategic roadmap through collaboration with Rainer Simon.
Project aims
1. Read map content at scale using tools for text, not images.
2. Integrate place entity linking and image annotation tools to make text on maps meaningful.
3. Improve map discovery and collection histories at cultural institutions.
4. Analyse text on maps.
The project aims to change the way that humanists and heritage professionals interact with map images. Maps constitute a significant body of global cultural heritage, and they are being scanned at a rapid pace in the US and UK. However, most critical investigation of maps continues on a small scale, through close ‘readings’ of a few maps.
Individual maps communicate through visual grammars, supplemented by text. But text on maps is an almost entirely untapped source for understanding how knowledge of place is constructed. Investigating map content at scale can teach us about what has been preserved and omitted in the cartographic record. Such knowledge is a key starting point for understanding why using map text to enrich collection metadata may be advisable (when collection records lack any or only the most superficial geographic or locational information) or potentially harmful (when map text replicates colonial power structures).
MRM will enable researchers and cultural institutions to generate and analyse this data, contributing to metadata creation and decolonisation efforts, and enhancing map accessibility and discoverability.
Applications
By predicting what type of content text on maps represents (buildings, mountains, etc.) and linking to gazetteers (indexes of places and related metadata, like locations), we unlock the potential for users to find and interpret maps by the thousands.
Extracting meaningful data from a range of scanned collections-not only maps, but any visual resource with printed text related to place, such as posters-will accelerate spatiotemporal-driven discovery of primary sources in the humanities. Furthermore, map text can be analysed in its own right, reshaping how researchers in any discipline interact with maps.
Cultural institutions can feed map text data back into their work to study the geographical coverage of their collections, or investigate differences between existing metadata and reported locations of map labels. On a sheet-by-sheet basis, for example, MARC fields for subjects and topics can be enriched by map text. After processing US and UK maps and linking them to historical gazetteers, we test linking UK map labels to Trade Directories, and matching Sanborn map data to US census records–making a significant contribution to both British and American digitised historical data. Such research test cases exemplify the versatility of map labels as primary sources.
Map text data can be used as training data for future GIScience/GeoAI machine learning tools for automatic map understanding. Our planned Kaggle competition, for example, will publicise our gold standard data created during the project and set in motion ongoing development of machine learning models to improve the accuracy and precision of creating map text data.
Recent updates
News
December 2020
USC Libraries Project Announcement
USC Dornsife College Spatial Sciences Institute Project Announcement