Introduction
Foundation models such as ChatGPT are large machine learning models trained on huge, broad data sets that are designed to perform a wide array of tasks. These models have demonstrated remarkable proficiency with generating realistic natural language, and show some ability to solve problems and perform common sense reasoning. However, current methods used to evaluate these models have a number of limitations that restrict our understanding of their capabilities.
This project aims to develop new methods that allow us to evaluate the strengths and weaknesses of foundation models to allow researchers and policymakers to make informed decisions about the safety and utility of these models. Our focus will be to benchmark such models and to delineate the limits of their capabilities: although such models appear to be very capable in some respects, they fail on apparently simple tasks, in unpredictable ways.
Explaining the science
Foundation models are an important emerging class of artificial intelligence (AI) systems, characterised by the use of very large machine learning models, trained with extremely large and broad data sets, requiring considerable compute resources during training. Large language models (LLMs) such as Open AI’s GPT-3 and GPT 4 and Google’s Bard and LaMDA are the best-known examples of foundation models. These models have attracted considerable attention for their ability to generate realistic natural language text and engage in sustained and frequently coherent natural language dialogues.
They have also demonstrated some capabilities in other domains, such as common-sense reasoning and problem-solving. A key hypothesis underlying work on foundation models is that they acquire competence in a broad range of tasks, which can then be specialised with further training for specific applications. Foundation models are already finding innovative applications, such as GitHub’s CoPilot system, which can generate computer code from natural language descriptions (“a Python function to find all the prime numbers in a list”).
This project will not be developing new foundation models but rather exploring how best to evaluate and benchmark them, for example using synthetic worlds and scenarios. We will also work to address several limitations of current benchmarking methods, such as the rapid pace at which benchmarks are becoming obsolete.
Project aims
We will focus on addressing three key questions:
- Learned values. We will investigate the extent to which foundation models learn human values. To what extent can it be said to “understand” and reflect human values raised in such scenarios? Are they consistent in the application of such values? Do they hold such values immutably, or are they malleable, depending on the prompts given to the model?
- Common sense reasoning. Large Language Models such as GPT appear to have some common-sense reasoning capabilities. But the boundaries of these capabilities remains unclear. We therefore aim to develop robust methods to evaluate the extent to which LLMs understand the physical, causal, spatial, temporal, and social rules that govern the world.
- Theory of mind. Much human reasoning is social, involving the beliefs and aspirations of others. To what extent can this capability be acquired by training on textual data sets?
Applications
The term Foundation Model reflects the expectation that these models can be the foundation for many different kinds of tasks and applications. We will not be developing applications per se in this project, but expect that our results will inform decisions about the most appropriate application areas for the current capabilities of foundation models.
Recent updates
Mike Wooldridge organised a Turing one day symposium at the IET in February 2023 with a number of speakers including two of the investigators, Mike Wooldridge and Tony Cohn. Catch up with a recording of the event. Watch this space for future events.