Defining data science and AI

There is a lot of jargon in data science and AI. We’ve created this glossary for non-specialists who want to find out more about these topics without the technical language. We also hope that it will be a useful resource for journalists and policy makers, as well as researchers in areas that intersect with data science and AI. This is an ongoing project, so we will regularly be reviewing the list of terms and definitions.

  • Algorithm

    A sequence of rules that a computer uses to complete a task. An algorithm takes an input (e.g. a dataset) and generates an output (e.g. a pattern that it has found in the data). Algorithms underpin the technology that makes our lives tick, from smartphones and social media to sat nav and online dating, and they are increasingly being used to make predictions and support decisions in areas as diverse as healthcare, employment, insurance and law.

  • Algorithmic bias

    Unfairness that can arise from problems with an algorithm’s process or the way the algorithm is implemented, resulting in the algorithm inappropriately privileging or disadvantaging one group of users over another group. Algorithmic biases often result from biases in the data that has been used to train the algorithm, which can lead to the reinforcement of systemic prejudices around race, gender, sexuality, disability or ethnicity.

  • Artificial intelligence (AI)
    The design and study of machines that can perform tasks that would previously have required human (or other biological) brainpower to accomplish. AI is a broad field that incorporates many different aspects of intelligence, such as reasoning, making decisions, learning from mistakes, communicating, solving problems, and moving around the physical world. AI was founded as an academic discipline in the mid-1950s, and is now found in myriad everyday applications, including virtual assistants, search engines, navigation apps and online banking.
  • Big data

    A wide-ranging field of research that deals with large datasets. The field has grown rapidly over the past couple of decades as computer systems became capable of storing and analysing the vast amounts of data increasingly being collected about our lives and our planet. A key challenge in big data is working out how to generate useful insights from the data without inappropriately compromising the privacy of the people to whom the data relates.

  • Chatbot

    A software application that has been designed to mimic human conversation, allowing it to talk to users via text or speech. Chatbots are mostly used as virtual assistants in customer service, but there are also chatbot therapists and even chatbot politicians.

  • Computer vision

    A field of research that uses computers to obtain useful information from digital images or videos. Applications include object recognition (e.g. identifying animal species in photographs), facial recognition (smart passport checkers), medical imaging (spotting tumours in scans), navigation (self-driving cars) and video surveillance (monitoring crowd levels at events).

  • Data science

    An umbrella term for any field of research that involves the processing of large amounts of data in order to provide insights into real-world problems. Data scientists are a diverse tribe, ranging from engineers, medics and climatologists to ethicists, economists and linguists.

  • Data-centric engineering

    A field of research that applies data science techniques to engineering problems. It often involves collecting copious amounts of data about the object being studied (this could be a bridge, a road network, a wind turbine, or even an underground farm), and then using the data to develop computer models for analysing and improving the object’s design and functioning (see ‘digital twin’).

  • Dataset
    A collection of numbers or words that can be analysed to obtain information. Datasets are often collected and stored in a tabular format, with each column corresponding to a different variable (e.g. height, weight, age) and each row corresponding to a different entry or ‘record’ (e.g. a different person). The data might come from real-life observations and measurements, or it can be generated artificially (see ‘synthetic data’).
  • Deep learning

    A form of machine learning that uses computational structures known as ‘neural networks’ to automatically recognise patterns in data and provide a suitable output, such as a prediction or evidence for a decision. Deep learning neural networks are loosely inspired by the way neurons in animal brains are organised, being composed of multiple layers of simple computational units (‘neurons’), and they are suited to complex learning tasks such as picking out features in images and speech. Deep learning thus forms the basis of the voice control in our phones and smart speakers, and enables driverless cars to identify pedestrians and stop signs. See also ‘neural network’.

  • Deepfake

    Synthetic audio, video or imagery in which someone is digitally altered so that they look, sound or act like someone else. Created by machine learning algorithms, deepfakes have raised concerns over their uses in fake celebrity pornography, financial fraud, and spreading false political information. ‘Deepfake’ can also refer to realistic but completely synthetic media of people and objects that have never physically existed; or sophisticated text generated by algorithms. See also ‘generative adversarial network’.

  • Digital twin

    A computer model that simulates an object in the real world, such as a jet engine, bridge, wind turbine, Formula One car, biological system, or even an entire city. Analysing the model’s output can tell researchers how the physical object will behave, helping them to improve its real-world design and/or functioning. Digital twins are a key tool in the field of data-centric engineering.

  • Encryption
    The process of encoding data for security or privacy reasons. An algorithm is used to convert the data (‘plaintext’) into an alternative form (‘ciphertext’) that can only be easily decrypted into its original form using a piece of data known as a ‘key’.
  • Generative adversarial network (GAN)

    A machine learning technique that can generate data, such as realistic ‘deepfake’ images, which is difficult to distinguish from the data it is trained on. A GAN is made up of two competing elements: a generator and a discriminator. The generator creates fake data, which the discriminator compares to real ‘training’ data and feeds back with where it has detected differences. Over time, the generator learns to create more realistic data, until the discriminator can no longer tell what is real and what is fake.

  • Human-in-the-loop (HITL)
    A system comprising a human and an artificial intelligence component, in which the human can intervene in some significant way, e.g. by training, tuning or testing the system’s algorithm so that it produces more useful results. It is a way of combining human and machine intelligence, helping to make up for the shortcomings of both.
  • Machine learning (ML)
    A field of artificial intelligence involving computer algorithms that can ‘learn’ by finding patterns in sample data. The algorithms then typically apply these findings to new data to make predictions or provide other useful outputs, such as translating text or guiding a robot in a new setting. Medicine is one area of promise: machine learning algorithms can identify tumours in scans, for example, which doctors might have missed.
  • Multi-agent system (MAS)
    A computer system involving multiple, interacting software programs known as ‘agents’. Agents often actively help and work with humans to complete a task – the most common everyday examples are virtual assistants such as Siri, Alexa and Cortana. In a multi-agent system, the agents talk directly to each other, typically in order to complete their tasks more efficiently. This could help in applications as diverse as multi-robot manufacturing, disaster response, and automatically coordinating meetings for multiple people.
  • Natural language processing (NLP)
    A field of artificial intelligence that uses computer algorithms to analyse or synthesise human speech and text. The algorithms look for linguistic patterns in how sentences and paragraphs are constructed, and how the words, context and structure work together to create meaning. Applications include speech-to-text converters, customer service chatbots, speech recognition, automatic translation, and sentiment analysis (identifying the mood of a piece of text).
  • Neural network

    An artificial intelligence system inspired by the biological brain, consisting of a large set of simple, interconnected computational units (‘neurons’), with data passing between them as between neurons in the brain. Neural networks can have hundreds of layers of these neurons, with each layer playing a role in solving the problem. They perform well in complex tasks such as face and voice recognition. See also ‘deep learning’.

  • Open source

    Software and data that are free to edit and share. This helps researchers to collaborate, as they can edit the resource to suit their needs and add new features that others in the community can benefit from. Open source resources save researchers time (as the resources don’t have to be built from scratch), and they are often more stable and secure than non-open alternatives because users can more quickly fix bugs that have been flagged up by the community. By allowing data and tools to be shared, open source projects also play an important role in enabling researchers to check and replicate findings.

  • Robot

    A machine that is capable of automatically carrying out a series of actions. The word ‘robot’ was coined by Czech writer Karel Čapek in his 1920 sci-fi play Rossum’s Universal Robots, but the idea of self-operating machines goes back to antiquity. Modern robots typically contain programmed computers and exhibit some form of artificial intelligence. They can include ‘humanoids’ that look and move like humans, industrial robots used in manufacturing, medical robots for performing surgery, and self-navigating drones.

  • Synthetic data

    Data that is generated artificially, rather than by real-world events. It is especially useful for research in areas where privacy is key, such as healthcare and finance, as the generated data can retain the original data’s statistical properties, but with any identifying information removed. Synthetic data can also be used to augment a dataset with additional data points, often to help an artificial intelligence system to learn some desirable property; or to train algorithms in situations where it is dangerous to get hold of the real data, such as teaching a self-driving car how to deal with pedestrians in the road.

  • Turing machine
    A hypothetical computer first conceptualised by Alan Turing in 1936, which is capable of running any algorithm, no matter how complicated. It consists of an infinitely long tape divided into ‘cells’, and a read/write head that can change the contents of each cell according to a pre-defined set of rules (an algorithm). The Turing machine is rich enough to be considered an abstract description of the vast majority of computers, and so forms the basis of much of modern computer science.
  • Turing test
    A test of a machine’s ability to demonstrate human-like intelligence. First introduced by Alan Turing as the “imitation game” in his 1950 paper “Computing Machinery and Intelligence”, the test involves a human evaluator asking questions to another human and a machine via a computer keyboard and monitor. If the evaluator cannot tell from the written responses which is the human and which is the machine, then the machine has passed the Turing test.