Huge amounts of complex data from patients and their genomes, which are in principle complemented and enriched by information gained from fundamental biomedical research, remain unlinked. By finding a way to formally integrate all this knowledge, it is possible to exploit multi-dimensional data, including, for example, data derived from biochemical experiments or studies on genetically manipulated model organisms. Such linked data offers the possibility of using machine learning to get new insights into the relationship between genetics and disease, aiding diagnosis, prognosis and the development of new therapeutic approaches.

Explaining the science

Clinical observations coming from precision medicine are now being supplemented by new streams of multi-dimensional background data from, for example, the millions of human genome sequences now available, descriptions of non-human animal models of disease, high-throughput biochemical and molecular experiments, and studies of drug interactions and side effects. Integration of patient data into this broader biomedical knowledge can potentially provide new insights, where unexpected links might be made based on deep knowledge underlying clinical measurements, for example protein functions or interactions. 

One of the most significant problems in mobilising and exploiting background knowledge is that of the meaningful integration of what is very large and highly complex data, often provided from isolated and unrelated databases. At the root of this is the problem of semantics – the differing terminologies used by different databases – often referred to as the Babel problem. The project will address this at two levels; firstly clinical and genomic data from patient electronic health records will be extracted and formally represented using standard 'ontologies'. Ontologies are structured terminologies for specific areas of knowledge, for example clinical signs and symptoms (phenotypes), which have formal relationships between the terms, and can be understood by computers. Secondly, the project will align existing terminologies to bridge the gap between data from diverse sources annotated with different ontologies or structured in different ways, rather in the way that the Rosetta stone allowed the mapping of concepts and words from three languages.

Once compatibility is achieved, the door is opened to integration with the huge volume of human genotype/phenotype data now available, together with data from fundamental experimental sciences scattered internationally across multiple public databases. The data can be linked into a very large network, or knowledge graph. This may then be exploited using machine learning approaches to discover, for example, links between patient groups not previously suspected (patient stratification), insights into genetic overlaps between seemingly unrelated diseases and traits, and new predictors of prognosis, clinical events and responsiveness to therapy. 

Project aims

Routinely collected clinical records allow investigation of many different diseases simultaneously, including inherently 'longitudinal' phenotypes related to multi-morbidity, complications, progression, drug response, and survival. A systems genomics approach coupled with electronic health records (EHRs) offers the tantalising possibility of uncovering shared aetiology (causes of disease) of many apparently different disorders, identifying potential synergies in the prevention and treatment of multiple disorders, and reducing polypharmacy (concurrent use of multiple medications) and associated treatment-related complications. This data can be linked to underlying information from fundamental scientific investigation, available through public databases, in a way that will allow mechanistic bridges to be made between phenotypic observations, genotypes and pharmacology. 

The aim of the project is to tie in all of this data to an underlying biomedical knowledgebase through formal relations, effectively establishing a description logic knowledge graph for the combination of information from patient records and comprehensive background biomedical knowledge. The platform that will be generated will permit the application of machine learning approaches to discover new and meaningful insights into the biology of disease, its diagnosis and treatment.


The resource that will be produced will be useful across a wide range of diseases and clinical applications, where it can be used for example to discover sub-groups of patients within one disease (patient stratification) who might have disease variants which have different underlying genetics that respond differently to treatments. It may also be used to identify underlying mechanisms of disease with the potential to support the development of new therapeutic approaches.

As a demonstration of the approach, the knowledgebase will initially be used to gain insights into the genetic basis of congenital hypothyroidism in patients identified by the Deciphering Developmental Disorders project who also show a wide range of other phenotypes, with the aim of identifying novel genetic variants or combinations of variants that might be responsible for their disease.