Measurement theory for data science and AI

Modelling the skills of learning machines and developing standardised benchmark tests

Project status



Just like a human, any AI algorithm has inherent capabilities and limitations which are important to understand. To assess either human or AI skills one has to go beyond what can be directly observed (e.g. exam scores or match results). This project will adapt techniques from psychological skill assessment to model and explain fundamental algorithmic abilities in data science and AI.

Explaining the science

Measurement theory studies the concept of measurement and scale. If you have a way to measure, say, the length of individual rods or planks, this should also allow you to then calculate the combined length of concatenated rods or planks. What relevant concatenation operations are there in data science and AI, and what does that mean for the underlying measurement scale?

Psychometrics explores the idea that many variables of interest - such as the difficulty of a test or the ability of a student - are latent variables that manifest themselves only indirectly through test results. Luckily, latent variable models are widely used in machine learning and so this will be an area of direct relevance to the project.

Ultimately, the kinds of conclusions this project wants to draw from its experiments with data are causal: 'this algorithm out-performs that algorithm on this data set because the classes are highly imbalanced'. The underlying reasoning is counterfactual: 'had the classes been balanced, the outcome would have been different'. Causal models are a topic of considerable current interest in machine learning and AI, and so here is a third set of ideas to be tapped into.

Project aims

An important objective of the project is to develop awareness in the data science and AI community of the importance of measurement scales, and how specific scales have associated allowable operations and statistics.

A second objective is to develop a machine learning equivalent of psychometrics, which might be called discometrics (from Latin: discere -- to learn). Just as psychometrics has developed tools to model the skills of a human learner and develop standardised (SAT) tests, so we need similar tools to model the skills of learning machines and have standardised benchmarks.

Thirdly, an important set of AI capabilities and skills is associated with privacy, fairness and prevention of discrimination: we want to make sure that AI algorithms take decisions for the right reasons and operate within the confines of the law. Developing measurement procedures and calibrated test suites for these latent skills is hence of particular significance.


The project will most directly affect data science, machine learning and AI methodology, in particular the empirical work needed to demonstrate that the AI algorithms do what they are supposed to do.

The outlook is that this can eventually lead to standardised skill rating scales similar to the well-known Elo rating in chess. Eventually this will enable performance certificates for AI algorithms, not dissimilar to energy efficiency assessments of buildings and appliances, or food hygiene certificates, and hence be important for users of AI technology, i.e. all of us.

Recent updates

A video explaining the main ideas and results of the project can be accessed here.

Other related talks are:
- Discovery Science 2020 keynote
- Classifier Calibration tutorial

A code repository with explanatory notebooks can be accessed here.


Contact info

[email protected]