The Institute of Biomedical and Clinical Sciences has played a leading role in identifying novel disease genes from next generation sequencing data. The pipeline used to identify disease causing mutations has developed extensively over the past 10 years, but current workflows remain heavily biased towards identifying variants that are predicted to alter the amino acid sequence of a protein or that affect well described canonical splice sets, yet variants elsewhere in a gene can affect splicing of the pre-mRNA. This project aims to develop a pipeline that uses machine learning to identify motifs from RNA sequencing data that result in splicing and create a robust scoring system to allow for prioritisation of variants. This will enable the comparison of splicing in different tissues, as well as allow us to score variants in sequence datasets to highlight those that may affect splicing for further analysis.
Explaining the science
In order to produce a functional protein, a gene is first transcribed into RNA and spliced, in order to remove the non-coding regions. This mechanism of mRNA maturation and splicing is not fully understood and defects in these processes have been implicated in various inherited disorders. Due to the complexity of splicing, it is difficult to identify DNA variants that may affect splicing using current bioinformatic methods, therefore more in-depth analysis methods are needed.
The aim of this project was to develop a pipeline, based on the machine learning tool SpliceAI, that could score and highlight variants in sequence datasets that could potentially cause disease by affecting splicing. SpliceAI is a machine learning tool, that analyses large datasets of RNA sequences, and compares these with the Human Genome DNA sequence, in order to learn the patterns and motifs cells use to splice out non-coding DNA. Using this information, the software looks for these patterns and motifs in the sequence data of patients and determines whether any variants present in their sequence might affect this process. Any highlighted variants can then be investigated further in the lab.
The first aim is to develop a pipeline which can identify genetic sequences from RNAseq data that result in splicing. The second aim of the project is to develop a robust scoring system which would allow prioritisation of variants based on the predicted effect on splicing.
The overall objective is to develop a script which could be incorporated into a standard pipeline in order to highlight the variants which may affect splicing, which would otherwise have been missed, for further investigation. This is important because the causative mutations in many patients with a rare disease are not identified using standard sequence analysis pipelines. There is therefore an urgent clinical need to develop new tools to assist in the variant interpretation process.
This work is of particular importance to the field of medical genetics, in order to assist in the identification of disease causing mutations. For example the NHS Genomic Medicine Centres, which perform diagnostic sequencing for patients.
SpliceAI, a machine learning-based tool that identifies splice variants in genomic data, was utilised to calculate the probability of each variant in a pre-mRNA transcript affecting splicing, by introducing or abolishing splice donor or splice acceptor sites. To assess the accuracy of SpliceAI, known splice-altering variants were investigated, in a blind test, alongside an equal number of non-splice-altering variants.
An R script was written in order to process the raw SpliceAI results and present the top 10 variants with the highest delta score probability of splice-alteration, alongside accompanying information such as the gene each variant lies within, and the position of splice altering relative to the variant position. SpliceAI could accurately identify variants with splice-altering properties, as high delta scores were generated for these known variants but not for variants that have no role in splice-altering, effectively validating the use of this program for this project.
A number of whole exome datasets, in which no disease causing variants could be identified, were then analysed using the same pipeline This revealed a number of potential variants which may be contributing to disease that are now being investigated in the laboratory.