Introduction
The problem being addressed is that of discovering all government services online, which are currently not centrally mapped. The solution needs to scale to billions of web pages. The method being developed at the Turing would improve the cyber security profile of these government services.
Explaining the science
The proposed solution iterates between a discovery step, where new candidate websites are added to a pool, and a classification step, where candidates are classified into government or not (or other categories of interest). The method is highly parameterised, in order to allow for tradeoffs in terms of computational costs, discoverability and precision at the classifications step. Related work explored as part of a Turing Data Study Group looks at ways to learn domain representations (embeddings) based on the contents of websites and their links in the web-graph, using deep learning techniques.
Project aims
The project's aim is to develop a scalable method to discover previously unknown government services on the web. Success would be to find a significant amount of new services, currently not included in existing central maps, with a reasonable accuracy. The underlying methodology is likely to be generalisable to areas beyond government services and would constitute a step towards creating a scalable unified approach for identifying topics of interest on the web.
Applications
The immediate application of the project outcomes is to profile the cybersecurity settings of all government services online, in order to provide advice and services in view of improving their security profile.
Recent updates
December 2018
July 2018
- Internal evaluation workshop
March 2018
- Delivery of prototype method with implementation and preliminary evaluation.