DeezyMatch training set for optical character recognition

Abstract

Optical character recognition (OCR) is the process of automatically transcribing text from images. The presence of OCR-induced errors in digitised text is a common problem in the digital humanities. OCR errors are usually due to the misrecognition of characters, such as "h" recognised as "b", or "c" recognised as "o". The DeezyMatch library was built to address this issue through fuzzy string matching, using a deep neural network approach. In order to train a DeezyMatch model, a training set consisting of positive and negative string pairs is needed. We present a new dataset of positive and negative OCR variations, which can be used to train a DeezyMatch model, which can then be used for fuzzy string matching for the downstream task of entity linking. This dataset has been automatically generated from word2vec embeddings trained on digitised historical news texts, and has been expanded with toponym alternate names extracted from Wikipedia.

Citation information

Coll Ardanuy, Mariona, Federico Nanni & Nilo Pedrazzini. 2023. DeezyMatch training set for OCR [Data set]. British Library Research Repository. https://bl.iro.bl.uk/concern/datasets/12208b77-74d6-44b5-88f9-df04db881d63

Turing affiliated authors