The problem being addressed is that of discovering all government services online, which are currently not centrally mapped. The solution needs to scale to billions of web pages. The method being developed at the Turing would improve the cyber security profile of these government services.
Explaining the science
The proposed solution iterates between a discovery step, where new candidate websites are added to a pool, and a classification step, where candidates are classified into government or not (or other categories of interest). The method is highly parameterised, in order to allow for tradeoffs in terms of computational costs, discoverability and precision at the classifications step. Related work explored as part of a Turing Data Study Group looks at ways to learn domain representations (embeddings) based on the contents of websites and their links in the web-graph, using deep learning techniques.
The project's aim is to develop a scalable method to discover previously unknown government services on the web. Success would be to find a significant amount of new services, currently not included in existing central maps, with a reasonable accuracy. The underlying methodology is likely to be generalisable to areas beyond government services and would constitute a step towards creating a scalable unified approach for identifying topics of interest on the web.
The immediate application of the project outcomes is to profile the cybersecurity settings of all government services online, in order to provide advice and services in view of improving their security profile.
- Internal evaluation workshop
- Delivery of prototype method with implementation and preliminary evaluation.
External team members
Dr. Paul Jones