Multimodal computation or interpretation? Automatic vs. critical understanding of text-image relations in racist memes in English

Polli, C.; Sindoni, M. G.

doi:10.1016/j.dcm.2024.100755

This paper discusses the epistemological differences between the label ‘multimodal’ in computational and sociosemiotic terms by addressing the challenges of automatic detection of hate speech in racist memes, considered as germane families of multimodal artifacts. Assuming that text-image interplays, such is the case of memes, may be extremely complex to disentangle by AI-driven models, the paper adopts a sociosemiotic multimodal critical approach to discuss the challenges of automatic detection of hateful memes on the Internet. As a case study, we select two different English-language datasets, 1) the Hateful Memes Challenge (HMC) Dataset, which was built by the Facebook AI Research group in 2020, and 2) the Text-Image Cluster (TIC) Dataset, including manually collected user-generated (UG) hateful memes. By discussing different combinations of non-hateful/hateful texts and non-hateful/hateful images, we will show how humour, intertextuality, and anomalous juxtapositions of texts and images, as well as contextual cultural knowledge, may make AI-based automatic interpretation incorrect, biased or misleading. In our conclusions, we will argue the case for the development of computational models that incorporate insights from sociosemiotics and multimodal critical discourse analysis.