Tackling the challenge of generalisable hate speech detection
Challenge
Online harms such as misinformation, malicious behaviour, and hate speech are pressing challenges for researchers, policymakers, and industry alike. Hate speech in particular is complex because it manifests in many forms: across languages, cultures, and online platforms, and within different demographics and communities. Expressions of hate can range from overt insults to coded language, emojis, or platform-specific slang.
This diversity presents a major problem for current hate speech detection models, which often fail when faced with data that differs from their training examples. For instance, a model trained on text without emojis may fail to recognise emojis used with hateful intent. These limitations risk both false positives (flagging harmless content) and false negatives (missing harmful content), reducing trust in automated moderation systems.
Led by Dr Arkaitz Zubiaga, Senior Lecturer in Computer Science at Queen Mary University of London, the research team set out to understand why hate speech detection models often fail to generalise and how to make them more robust across the wide variety of contexts in which hate speech occurs.
Approach
Dr Zubiaga’s team conducted one of the first comprehensive reviews of hate speech detection methods, drawing from natural language processing (NLP) and computational social science (CSS). The project examined how models performed across multiple dimensions: languages, cultures, platforms, demographics, and types of hate speech such as sexism, racism, or homophobia.
Central to the analysis was cross-dataset testing. If a model is trained on datasets A and B, can it still identify hate speech in dataset C, one it has never seen before? By systematically assessing transferability, the researchers pinpointed where models succeeded and where they broke down.
The study also highlighted progress in the field, including the global efforts to collect and label new datasets, and the use of unlabelled data to improve model performance. However, dataset creation remains resource-intensive, with most datasets still too small to represent the full spectrum of hate speech online.
The project was interdisciplinary, combining data science with insights from psychology to capture the human and social dimensions of online abuse. Students and early-career researchers were also closely involved, supporting data analysis and contributing to the synthesis of findings.
Impact
The review revealed that many existing hate speech detection models lack robustness because they are trained narrowly on specific datasets that fail to capture the diversity of online expression. This means they often struggle to generalise, leading to real-world failures.
The study has had a major impact on the research community. Cited over 250 times, it has guided new approaches to dataset design, cross-platform evaluation, and model development, with researchers worldwide tackling the problem of generalisability from different angles.
The work also underscores important ethical considerations. Misclassifications can have harmful consequences: censoring friendly conversation or, conversely, allowing abusive material to spread. Human moderators are often required as a safeguard, but repeated exposure to harmful content can negatively affect their wellbeing — highlighting the urgency of improving automated systems.
One striking example comes from an online chess community that was wrongly flagged as hateful because of repeated references to “black” and “white.” This error, later corrected by human moderators, illustrates the importance of context-sensitive AI systems that can distinguish between innocuous and harmful use of language.
Ultimately, the research led by Dr Zubiaga emphasises that trustworthy AI in hate speech detection depends on better dataset design, interdisciplinary collaboration, and ethical oversight. Improving model generalisability not only reduces harms online but also supports healthier digital communities worldwide.
Dr Arkaitz Zubiaga is part of the Centre for Human-Centred Computing.