Introduction
‘Grey-hat’ online forums bring together individuals interested in hacking and illicit online monetising techniques. How can we use natural language processing on forum content to understand these communities and identify the ‘key actors’ likely to be involved in genuine cybercrime?
Explaining the science
Natural language processing, or NLP, is a branch of artificial intelligence that deals with analysing, understanding, and generating the languages that humans use. This is done in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages.
One of the use cases of NLP is document classification which involves categorising texts into different types based on their linguistic content. An everyday example of document classification in use are spam filters on email, which divert incoming messages into your junk folder if certain linguistic (and non-linguistic) features are found.
Project aims
Cybercrime has proliferated in recent years, and online forums have been a key area of research for both criminologists and computer scientists. A ‘grey-hat’ forum is one in which not all the contents and goods posted on these forums are illegal, but their origin or use may be. These communities can provide a stepping stone towards more serious online criminal activities.
Previous research on these forums has relied on manually collected, incomplete, or outdated datasets. While insightful, this existing research has had a narrow focus (e.g. marketplaces, hacking material, or indecent images), and only analysed data from short periods of time.
In this project, analysis is being conducted on a massive dataset obtained by crawling and scraping an assortment of online forums. This dataset presents a unique opportunity to understand these communities at scale, and allows for longitudinal social data analysis. Due to the complexity of these forums, and the unique lexicon used, automatic analysis will be a significant challenge, but still much more feasible than manual analysis.
Key questions
- What are the pathways by which individuals become ‘key actors’: prolific traders in goods and services whose peers give them high reputation scores?
- What variables predict which individuals will evolve into key actors? Is it possible to predict if they will be particularly interested in malware, DDoS, system compromise, etc?