Machine learning has the potential to accelerate the process of identifying and designing novel materials and molecules, which could lead to new types of solar cells, batteries, pharmaceuticals, and more. Many central problems in materials design can be framed as an optimisation problem, of searching for a molecule with a given set of desired properties. Two major obstacles: first, molecular space is enormous – for example, the number of potential drug-like molecules has been estimated to be on the order of 10^23 to 10^60 – and second, small changes in the structure of a molecule can have large changes in its properties.

Explaining the science

The first step in designing machine learning models for molecules is to decide on a choice of representation. One easy place to start is to describe a molecule as text, in a formal language like the SMILES language. For example, in this language, a molecule of caffeine would be written as “CN1C=NC2=C1C(=O)N(C(=O)N2C)C”. The letters correspond to different types of atoms, and the other symbols describe different types of chemical bonds, as well as which atoms bond to which.

This representation is much richer than a simple molecular formula C8H10N4O2. An alternative to this text-based representation is a graph-based representation, in which each atom in the molecule is a node in the graph, and the adjacency matrix is defined based on the bonds. A nice thing about representing molecules as text or as graphs is that these are data types we are already used to working with: many existing machine learning models for natural language, or for network data, can be adapted to work on molecules.

However, when designing algorithms which will propose new, novel molecules, it’s important to remember that not every SMILES string and not every molecular graph is chemically feasible. Some rules are easy to write down as constraints – such as restrictions on how many bonds a particular atom may make at a time – while others, such as toxicity or stability, are themselves tricky properties to evaluate.


José Miguel Hernández-Lobato presented a tutorial on machine learning for molecules at a machine learning summer school in Madrid in September 2018. These provide a background on related methods, as well as provide an overview of recent work done by members of this group:

Brooks Paige at Microsoft Research Cambridge

Matt Kusner at ICML Sydney

This group has co-organised two workshops at the NeurIPS conference in 2017 and 2018, and also organising another workshop at NeurIPS 2020 this year.


The main aim of this group is to design models and algorithms that can translate into real-world impact by enabling faster and more efficient approaches to molecular design. Our goal is to establish a focal point at The Alan Turing Institute for investigating machine learning and chemical modeling, and support cross-disciplinary collaboration between researchers with different expertise and backgrounds interested in this area.

Talking points

Molecular data is expensive and time-consuming to collect

Challenges: Computational approaches are slow, and maybe incorrect; lab testing is expensive and even slower

Example output: Data-efficient algorithms based on active learning or on Bayesian optimisation

Models need to be interpretable if they are going to see adoption

Challenges: A machine learning model (even one with high accuracy) will not be used if it fails in ways which are chemically or physically implausible, eroding trust in the system

Example output: Models which explicitly encode chemical and physical constraints

Machine learning models can suggest a molecule, but not how to create it

Challenges: Even if a machine learning model proposes a candidate molecule, eventually it will need to be synthesised in a lab

Example output: Models for proposing candidate molecules which explicitly propose synthetic routes

Small changes in molecular structure can have large changes in properties

Challenges: Optimising molecules – and even predicting properties – is difficult, because chemical space is not smooth

Example output: Learning continuous embeddings of molecules which respect notions of molecular similarity

How to get involved

Click here to request sign-up and join



John Bradshaw

MPI for Intelligent Systems at Tübingen, PhD student at University of Cambridge

Contact info

Brooks Paige, [email protected]