Introduction

The problem being addressed is that of discovering all government services online, which are currently not centrally mapped. The solution needs to scale to billions of web pages. The method being developed at the Turing would improve the cyber security profile of these government services.

Explaining the science

The proposed solution iterates between a discovery step, where new candidate websites are added to a pool, and a classification step, where candidates are classified into government or not (or other categories of interest). The method is highly parameterised, in order to allow for tradeoffs in terms of computational costs, discoverability and precision at the classifications step. Related work explored as part of a Turing Data Study Group looks at ways to learn domain representations (embeddings) based on the contents of websites and their links in the web-graph, using deep learning techniques.

Project aims

The project's aim is to develop a scalable method to discover previously unknown government services on the web. Success would be to find a significant amount of new services, currently not included in existing central maps, with a reasonable accuracy. The underlying methodology is likely to be generalisable to areas beyond government services and would constitute a step towards creating a scalable unified approach for identifying topics of interest on the web.

Applications

The immediate application of the project outcomes is to profile the cybersecurity settings of all government services online, in order to provide advice and services in view of improving their security profile.

Recent updates

December 2018

July 2018

  • Internal evaluation workshop

March 2018

  • Delivery of prototype method with implementation and preliminary evaluation.

Organisers

Contact info

[email protected]

 

External team members

Dr. Paul Jones

Research Engineering

View the Research Engineering page

Members of the Research Engineering Group at the Turing are contributing their expertise to this project.

Research data scientists from the group are collaborating with NCSC experts in order to develop novel ways to explore and classify websites at scale. The NCSC motivation is that of detecting and profiling existing online government services to improve their security level. The Turing's main contribution is making existing methods scale to billions of web-pages, and improving on the current state of the art.

The Research Engineering group was also involved in organising and facilitating a Data Study Group challenge on learning website representations (embeddings) for this purpose.