Developing computational methods for identifying the emergence of new word meanings using social media data, advancing understandings of cultural and linguistic interaction online, and improving natural language processing tools.
Students on the project: Farhana Ferdousi Liza and Philippa Shoemark
Barbara McGillivray, Turing Research Fellow, University of Cambridge
Dong Nguyen, Turing Research Fellow, University of Edinburgh
Scott Hale, Turing Fellow, University of Oxford
This project focuses on developing a system for identifying new word meanings as they emerge in language, focussing on words entering English from different languages and changes in their polarity (e.g., from neutral to negative or offensive). An example is the word kaffir, which, starting from a neutral meaning, has acquired an offensive use as a racial or religious insult. The proposed research furthers the state of the art in Natural Language Processing (NLP) by developing better tools for processing language data semantically, and has impact on important social science questions.
Language evolves constantly through social interactions. New words appear, others become obsolete, and others acquire new meanings. Social scientists and linguists are interested in investigating the mechanisms driving these changes. For instance, analysing the meaning of loanwords from foreign languages using social media data helps us understand the precise sense of what is communicated, how people interact online, and the extent to which social media facilitate cross-cultural exchanges. In the case of offensive language, understanding the mechanisms by which it is propagated can inform the design of collaborative online platforms and provide recommendations to limit offensive language where this is desired.
Detecting new meanings of words is also crucial to improve the accuracy of NLP tools for downstream tasks, for example in the estimation of the “polarity” of words in sentiment analysis (e.g. sick has recently acquired a positive meaning of ‘excellent’ alongside the original meaning of ‘ill’). Work to date has mostly focused on changes over longer time periods (cf., e.g., Hamilton et al. 2016). For instance, awful in texts from the 1850s was a synonym of ‘solemn’ and nowadays stands for ‘terrible’.
New data on language use and new data science methods allow for studying this change at finer timescales and higher resolutions. In addition to social media, online collaborative dictionaries like Urban Dictionary are excellent sources for studying language change as it happens; they are constantly updated and the threshold for including new material is lower than for traditional dictionaries.
The meaning of words in state-of-art NLP algorithms is often expressed by vectors in a low-dimensional space, where geometric closeness stands for semantic similarity. These vectors are usually fed into neural architectures built for specific tasks. The proposed project aims at capturing meaning change on a fine-grained, short time scale. We will use the algorithm developed by Hamilton et al. (2016), who used it to identify new meanings using Google Books. We will train in-house vectors on multilingual Twitter data collected from 2011 to 2017.
Through this process we will identify meaning change candidates and evaluate them against the dictionary data by focusing on analysing the factors that drive foreign words to enter the English language and to change their polarity. In doing so, we will shed light on the extent to which the detected meaning changes are driven by linguistically internal rather than external (e.g. social, technological, etc.) factors.