Decade-level Word2Vec models from automatically transcribed 19th-century newspapers digitised by the British Library (1800-1919)

Abstract

Word embeddings trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec and the following parameters:

  • sg = True
  • min_count = 5
  • window = 5
  • vector_size = 100
  • epochs = 5

The embeddings are divided into periods of ten years each. Unlike those in this repository, these were not aligned and OCR errors skimmed from the vocabulary.

See related GitHub repository for the full documentation: https://github.com/Living-with-machines/DiachronicEmb-BigHistData

Project website (Living with Machines): https://livingwithmachines.ac.uk/

Citation information

Nilo Pedrazzini. (2023). Decade-level Word2Vec models from automatically transcribed 19th-century newspapers digitised by the British Library (1800-1919) [Data set].

Turing affiliated authors