Language models for quantitative science studies

Developing models to represent the use of language in research and improve performance of specific tasks, including the retrieval and summarisation of literature


The problem being addressed is that of understanding how language is used in different scholarly communities. This project's approach is to apply deep learning methods to very large datasets of publications. These models allow for the comparison of how different communities discuss and communicate research findings. The models will also be tested in specific machine learning tasks such as finding and ranking, or summarising, scholarly literature given a specific query.

The project is funded by the Centre for Science and Technology Studies (CWTS), Leiden University (NL).

Explaining the science

The project aims at using unsupervised deep learning approaches to learn high-dimensional representations of the use of words, sentences and whole documents (aka embeddings). These can be explored in their geometric properties, in order to reveal specific characteristics of language as used within a community or by an author, such as the presence of synonyms or polysemic words.

Project aims

The main goal of the project is to conduct research aimed at learning scalable language models for different scholarly communities, represented as collections of publications or authors. There is a lack of understanding of how different disciplines communicate their research results as published literature. Most crucially, these results are conveyed as text, while until now citations have been the most common form of data used in science studies.

This project aims at addressing this issue, by modelling large datasets of literature using novel unsupervised deep learning approaches. Language models can be used to understand how language use varies, and for applications, such as information retrieval and the summarisation of literature. The project also has the general goal of developing a focus area on science studies and applications within the Turing Data Science for Science program, in collaboration with the Centre for Science and Technology Studies (CWTS) at Leiden University.


The immediate application of the project outcomes is to explore the use of language from different disciplines and scholarly communities, with a specific focus on comparing the humanities and the sciences. The practical application will be a contribution to still open challenges in scientific information retrieval and the summarisation of scholarly literature.

Researchers and collaborators

Contact info

[email protected]