Monitoring a single data source for malicious activity is unlikely to be effective in a large-scale application. This project aims to improve existing statistical methodology for anomaly detection with multiple graphs (symbolic representations of networks), and will principally investigate solutions based on 'joint spectral embedding'. Through the Turing, the project will work with industry and government partners to see prototype algorithms and software deployed.
Explaining the science
Spectral embedding is a procedure that represents the nodes (e.g. individuals) of a network (e.g. friendships) as points in space, allowing a much more straightforward analysis. For example, a classical clustering algorithm (e.g. k-means or Gaussian mixture modelling) can be applied to the points to recover communities, the points can be used within a classifier to predict node characteristics based on connectivity patterns, or finally they can be used to predict future connections and detect anomalies.
A body of statistical research now allows each of these tasks to be performed precisely, with formal quantification of uncertainty. However, many cyber-security problems often involve multiple networks, and here much less is known. This project will develop theory and methodology surrounding joint spectral embedding, i.e. mapping the nodes from different graphs into the same space, and exploit this technique to reveal malicious behaviours that cannot be detected from a single network data source.
The last decade has seen important statistical progress on the analysis of graphs. Principled and scalable inference is now possible, for example, using spectral embedding. However, results are largely restricted to the analysis of a single, simple graph. This limitation is problematic in several cyber-security applications where a signal of interest, such as the presence of an intruder, is often hidden across multiple network data sources. For example, signs of an advanced persistent threat could be found in corporate email and web logs, host-based data, network flow, authentication data and more.
This project aims to set out a principled and tractable framework, at the intersection of mathematics, statistics and computer science, to allow more effective use of multi-modal cyber-security data. A first success would be to extend current spectral embedding methodology to allow more effective sharing of information between two graphs. This might allow an analyst, for example, to overlay two networks and pick out significant behavioural discrepancies to reveal evidence of an intrusion.
More generally, if recent statistical results for simple graphs can be reproduced in more complex scenarios involving multiple graphs, directionality and dynamics, then substantial improvements in threat detection can be expected. Of course, such developments would benefit society much more widely, since complex network data now arise in virtually every branch of academia, government and industry.
The project will involve working with industry and government partners to see prototype algorithms and software deployed on cyber-security applications relevant to national security. The research should generate methods of analysis that could form part of a cyber-security practitioner's standard toolkit, and in this respect any company with a security operations centre could benefit. Inference with multiple graphs also has important applications in genetics, biology, medicine and recommender systems.