Web domain discovery

Developing a scalable way to discover previously unknown government services on the web

Introduction

The problem being addressed is that of discovering all government services online, which are currently not centrally mapped. The solution needs to scale to billions of web pages. The method being developed at the Turing would improve the cyber security profile of these government services.

Explaining the science

The proposed solution iterates between a discovery step, where new candidate websites are added to a pool, and a classification step, where candidates are classified into government or not (or other categories of interest). The method is highly parameterised, in order to allow for tradeoffs in terms of computational costs, discoverability and precision at the classifications step. Related work explored as part of a Turing Data Study Group looks at ways to learn domain representations (embeddings) based on the contents of websites and their links in the web-graph, using deep learning techniques.

Project aims

The project's aim is to develop a scalable method to discover previously unknown government services on the web. Success would be to find a significant amount of new services, currently not included in existing central maps, with a reasonable accuracy. The underlying methodology is likely to be generalisable to areas beyond government services and would constitute a step towards creating a scalable unified approach for identifying topics of interest on the web.

Applications

The immediate application of the project outcomes is to profile the cybersecurity settings of all government services online, in order to provide advice and services in view of improving their security profile.

Recent updates

December 2018

July 2018

  • Internal evaluation workshop

March 2018

  • Delivery of prototype method with implementation and preliminary evaluation.

Contact info

[email protected]

 

External team members

Dr. Paul Jones