Introduction
Machine learning has the potential to accelerate the process of identifying and designing novel materials and molecules, which could lead to new types of solar cells, batteries, pharmaceuticals, and more. Many central problems in materials design can be framed as an optimisation problem, of searching for a molecule with a given set of desired properties. Two major obstacles: first, molecular space is enormous – for example, the number of potential drug-like molecules has been estimated to be on the order of 10^23 to 10^60 – and second, small changes in the structure of a molecule can have large changes in its properties.
Explaining the science
The first step in designing machine learning models for molecules is to decide on a choice of representation. One easy place to start is to describe a molecule as text, in a formal language like the SMILES language. For example, in this language, a molecule of caffeine would be written as “CN1C=NC2=C1C(=O)N(C(=O)N2C)C”. The letters correspond to different types of atoms, and the other symbols describe different types of chemical bonds, as well as which atoms bond to which.
This representation is much richer than a simple molecular formula C8H10N4O2. An alternative to this text-based representation is a graph-based representation, in which each atom in the molecule is a node in the graph, and the adjacency matrix is defined based on the bonds. A nice thing about representing molecules as text or as graphs is that these are data types we are already used to working with: many existing machine learning models for natural language, or for network data, can be adapted to work on molecules.
However, when designing algorithms which will propose new, novel molecules, it’s important to remember that not every SMILES string and not every molecular graph is chemically feasible. Some rules are easy to write down as constraints – such as restrictions on how many bonds a particular atom may make at a time – while others, such as toxicity or stability, are themselves tricky properties to evaluate.
Talks
José Miguel Hernández-Lobato presented a tutorial on machine learning for molecules at a machine learning summer school in Madrid in September 2018. These provide a background on related methods, as well as provide an overview of recent work done by members of this group:
Brooks Paige at Microsoft Research Cambridge
This group has co-organised two workshops at the NeurIPS conference in 2017 and 2018, and also organising another workshop at NeurIPS 2020 this year.
Aims
The main aim of this group is to design models and algorithms that can translate into real-world impact by enabling faster and more efficient approaches to molecular design. Our goal is to establish a focal point at The Alan Turing Institute for investigating machine learning and chemical modeling, and support cross-disciplinary collaboration between researchers with different expertise and backgrounds interested in this area.
Talking points
Molecular data is expensive and time-consuming to collect
Challenges: Computational approaches are slow, and maybe incorrect; lab testing is expensive and even slower
Example output: Data-efficient algorithms based on active learning or on Bayesian optimisation
Models need to be interpretable if they are going to see adoption
Challenges: A machine learning model (even one with high accuracy) will not be used if it fails in ways which are chemically or physically implausible, eroding trust in the system
Example output: Models which explicitly encode chemical and physical constraints
Machine learning models can suggest a molecule, but not how to create it
Challenges: Even if a machine learning model proposes a candidate molecule, eventually it will need to be synthesised in a lab
Example output: Models for proposing candidate molecules which explicitly propose synthetic routes
Small changes in molecular structure can have large changes in properties
Challenges: Optimising molecules – and even predicting properties – is difficult, because chemical space is not smooth
Example output: Learning continuous embeddings of molecules which respect notions of molecular similarity
How to get involved
Organisers
Researchers
John Bradshaw
MPI for Intelligent Systems at Tübingen, PhD student at University of CambridgeMarwin Segler
Microsoft Research, CambridgeContact info
Brooks Paige, [email protected]