Introduction

Recent proposals for securing systems against sophisticated attackers, such as Advanced Persistent Threats (APTs), include wholesale monitoring of system activity down to the level of individual system calls, sometimes in a causal, graph-based representation called provenance.  While this level of detail should ensure malicious activity is recorded, and that the sources and effects of such activity can be understood after-the-fact, it is challenging to locate such activity against a background of high-volume, high-velocity activity, with high variability in the structure and meaning of data obtained from different sources.  Automatic, reliable detection of realistic APT behavior in provenance traces (with an acceptable false-positive rate) appears to be an open problem.

Until recently, progress in this area has been hindered by the absence of publicly-available datasets.  In the US, a recently-concluded DARPA research programme on Transparent Computing has produced publicly-available datasets including realistic APT behavior in provenance traces recorded on a variety of mainstream operating systems.  However, these datasets are not easy for the broader research community to reuse, due to large scale (each day of activity can require more than a gigabyte), and hetereogeneity.  Moreover, attacks are highly imbalanced (often consisting of under 0.01% of the data) and ground truth information that could be used for training is usually not available, rendering supervised machine learning techniques ineffective.  Relevant techniques for semi-supervised or unsupervised machine learning or outlier/anomaly detection may require adaptation to deal with the large scale of the data or its complex structure. 

About the event

This workshop aims to bring together researchers with expertise in security, data management, and machine learning, each of which bear on this challenge.  The workshop will also involve participants in the DARPA Transparent Computing programme who can share experience and understanding of the problem and the available datasets.  Breakout sessions will be organised to enable participants to contribute to a research vision and agenda for future work in this area, which will form the basis of a workshop report.

This event is suitable for individuals who are interested in the following areas of data science and AI:

  • Security researchers interested in provenance analysis or advanced persistent threats
  • Data scientists interested in applying anomaly detection and unsupervised machine learning to security problems
  • Database or distributed systems experts interested in supporting high-performance security analysis over complex information streams

The topics of this workshop cut across several challenge themes at the Turing, particularly Defence & Security.  Participants from a range of backgrounds are welcome to apply to attend; if space is limited, participants will be selected so as to ensure diversity of backgrounds and perspectives.

Agenda

Agenda for "Provenance, security & machine learning"

The programme will include breakout sessions, tentatively on the following four topics, which may be adjusted on the day based on audience interest:

  • Data: Data scale, velocity and variety challenges: incremental/streaming processing, querying, retention
  • Security: Making provenance useful for security; also new security or privacy issues raised by wholesale provenance collection (i.e. surveillance)
  • Machine learning: What are the appropriate/applicable machine learning techniques? (graph anomaly detection, unsupervised learning, active/semi-supervised learning)
  • Global challenges: What needs to be done for suitable datasets/challenges to be available and useful to the relevant communities?

Apply to attend

Apply to attend now close

Speakers

Organisers

Location

The Alan Turing Institute

1st floor of the British Library, 96 Euston Road, London, NW1 2DB

51.5297753, -0.12665390000006