Every new technology needs to be tested to make sure it is safe and effective. AI is no different. But there’s a problem: AI models are advancing so quickly that the tests are struggling to keep up.
Large language models (LLMs), which power chatbots such as ChatGPT and DeepSeek, are already getting near full marks on the most common capability tests. That’s impressive, but it also makes it more difficult to compare their abilities and monitor their breakneck progress.
A new test called Humanity’s Last Exam may provide a solution. Developed by a team at the Center for AI Safety and Scale AI, and consisting of 3,000 challenging questions submitted by researchers around the world (including myself!), the test is aiming to be the definitive benchmark for measuring LLM capabilities.
Testing, testing
There are lots of different ways to test LLMs, from both a performance and a safety perspective. For example, prior to release, AI developers will assess an LLM’s resistance to being used for malicious purposes. OpenAI has documented how its o1 model was assessed according to how often it complies with requests for harmful content, ‘hallucinates’ inaccurate answers, or makes stereotyped responses.
There are also independent organisations that evaluate LLMs, including METR (Model Evaluation & Threat Research), Apollo Research and the AI Safety Institute. At the Turing, my team has developed a benchmark for evaluating the risk of LLMs being used to autonomously exploit software vulnerabilities.
Often, however, these tests only cover a narrow subject area (ours is specific to cyber security), or include only a small number of tasks. Attempts to create a broader, standardised benchmark for model comparison include the MMLU (Measuring Massive Multitask Language Understanding), which uses around 16,000 multiple-choice questions to test models’ general knowledge and problem-solving abilities.
But the latest LLMs are now achieving over 90% accuracy on benchmarks such as MMLU – partly due to the models’ increasing capabilities (causing ‘benchmark saturation’), but also because many of the tests are openly available online and likely form part of the dataset that was used to train the LLM in the first place.
You may now turn over your papers
Humanity’s Last Exam represents AI models’ toughest test yet. The 3,000 questions, written specifically for the project by domain experts, cover over 100 subjects ranging from classics, ecology and maths to linguistics, chemistry and medicine (my two questions are in the areas of cryptography and computer security).
Most of the questions are kept secret, to prevent the LLM from scraping the answers off the internet, but here are a couple examples:
- Here is a representation of a Roman inscription, originally found on a tombstone. Provide a translation for the Palmyrene script. A transliteration of the text is provided: RGYNᵓ BT ḤRY BR ᶜTᵓ ḤBL

- Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
LLMs are scored according to how many questions they get right, and are also asked for their confidence when answering – this could be useful for finding out, for example, whether there’s a correlation between the confidence (or over-confidence) of an LLM and its tendency to create inaccurate outputs (hallucinations).
Professor AI?
At the time of going to press, OpenAI’s Deep Research tool (powered by a version of its o3 model) has the highest score (26.6%) on Humanity’s Last Exam, followed by OpenAI’s o3-mini (10.5-13.0%) and DeepSeek’s R1 (9.4%).
According to the exam’s creators, “it is plausible that models could exceed 50% accuracy by the end of 2025”. If that is the case – and it seems likely given that the jump from 9.4% to 26.6% took less than two weeks – it might not be long before models are maxing out this benchmark, too. So will that mean we can say LLMs are as intelligent as human professors?
Not quite. The team is keen to point out that it is testing structured, closed-ended academic problems “rather than open-ended research or creative problem-solving abilities”. Even if an LLM scored 100%, it would not be demonstrating artificial general intelligence (AGI), which implies a level of flexibility and adaptability akin to human cognition.
In my view, an AGI system would need to demonstrate several key capabilities, including high performance across multiple domains and data types, dynamic learning, self-improvement and intuitive reasoning. AI experts have long debated if and when AGI will happen but they are increasingly speculating that it could be a matter of decades or even years.
Whatever the future holds, Humanity’s Last Exam provides a useful way to track LLMs’ performance. Models that score highly will still be extremely powerful tools, and I can see them being used by researchers to explore scientific questions that we don’t know the answers to.
There are echoes here of the work of Alan Turing himself. In 1950, he asked “can machines think?”, in a paper that introduced what is now called the Turing test – the most famous AI benchmark of them all (and one that ChatGPT apparently passed in 2024). Seventy-five years on, that question has never felt more relevant.
Top image: adapted from a photo by Akshay Chauhan (Unsplash)