Abusive online content, such as hate speech, is a widespread problem. It inflicts serious harm on people who are targeted, and threatens open and pluralistic discourse in online spaces. To keep their users safe (and avoid legal and financial penalties), platforms increasingly rely on artificial intelligence (AI) to detect hateful content. Facebook, for instance, used AI to remove almost 97% of posts that it classified as hate speech in Q1 2021. However, despite impressive headline results, it is unclear how good AI actually is at detecting hate.

AI is widely lauded as a way of reducing the burden on human moderators, who often suffer from serious emotional and psychological problems from their work. However, to understand whether AI could, and should, replace human moderators, we need to understand its strengths and limitations.

In our new research, we introduce HateCheck, a suite of functional tests for hate speech detection models. HateCheck provides diagnostic insights into specific model functionalities, i.e. their ability to correctly classify different kinds of hateful and non-hateful content. It offers a targeted and granular hate speech evaluation model. The critical weaknesses it reveals in current hate speech detection models poses new and important questions about the widespread use of AI for content moderation.

Why do we need HateCheck?

Until now, AI models for hate speech detection have primarily been evaluated by measuring their performance on ‘benchmark’ hate speech datasets. These datasets can be highly biased. They typically cover only small subsets of hateful content (e.g., racist and sexist hate), observed on specific platforms (mostly Twitter) and were created using biased sampling strategies (e.g., keyword searches). Therefore, performance on these datasets is at best an incomplete measure of model quality and at worst is misleading.

How did we build HateCheck?

There are 29 functional tests in HateCheck – 18 tests for different kinds of hateful content and 11 tests for challenging non-hateful content. To motivate their selection, we reviewed previous hate speech research and interviewed civil society stakeholders from NGOs whose work directly relates to online hate. For example, several interviewees were concerned about the misclassification of counter speech, i.e. direct responses to hate that denounce it. Consequently, we included functional tests for counter speech that quotes or references hateful language.

Each test case within a given functional test is a simple statement that is either clearly hateful or non-hateful. We wanted to expose models that relied on overly simplistic decision rules, which is why we created non-hateful cases (e.g. “I love immigrants”) as direct contrasts to hateful cases (e.g. “I hate immigrants”). In total, HateCheck covers 3,786 test cases.

What did we learn from testing models with HateCheck?

HateCheck can be used to test any English-language hate speech detection model. In our article, we tested two current academic models as well as two popular commercial models to illustrate HateCheck’s function as a diagnostic tool. For the academic models, we trained near-state-of-the-art transformer-based neural networks on two widely used hate speech datasets. For the commercial models, we chose Google Jigsaw’s Perspective and Two Hat’s SiftNinja.

HateCheck reveals clear functional weaknesses in all the AI models that we tested.

Both academic and commercial models appear to be overly sensitive to specific keywords. For example, they correctly detect nearly all hateful uses of slurs. However, they misclassify most non-hateful reclaimed uses of slurs (e.g. “I’m a f*g. Deal with it”). This suggests that the models encode overly simplistic keyword-based decision rules (e.g. that all slurs are always hateful) rather than capturing important nuances. This means that the models penalise the very communities that are most commonly targeted by hate speech in the first place.

HateCheck also demonstrates that models are biased in how they handle hate against different targeted groups. They are far worse at detecting hate aimed at some protected groups (e.g. women) than others (e.g. Muslims). This signifies that the models are actively reinforcing biases and shows how different groups are protected, or not protected, in online spaces.

Where next for hate speech detection models?

For practical applications such as content moderation, the weaknesses we identified are critical, and better models urgently need to be developed. This will require neural network architectures that better capture the complexity of natural language. Perhaps more importantly, it will require larger training datasets with fewer systematic gaps and biases. HateCheck can support researchers in creating such datasets by guiding targeted data augmentation.

Key takeaways

AI models for hate speech detection play a crucial role in creating safe and open online spaces – and are already widely deployed. Until now, they have been evaluated on biased and incomplete hate speech benchmark datasets. This has risked them being mischaracterised and quality overestimated. HateCheck addresses this issue by actually diagnosing what the AI can and cannot do. Our work identifies critical weaknesses in current models from academia and industry and we hope that HateCheck aids the development of better AI in the future.

Please see the full HateCheck article, forthcoming at ACL 2021. HateCheck is freely available for commercial and academic use via GitHub. For further information, email Paul Röttger, [email protected]