Semi-supervised annotation of Portuguese hate speech across social media domains

Raquel Bento Santos; Bernardo Cunha Matos; Paula Carvalho; Fernando Batista; Ricardo Ribeiro

Ciência_Iscte Publications Publication Detailed Description

Publication in conference proceedings

Semi-supervised annotation of Portuguese hate speech across social media domains

Raquel Bento Santos (Santos, R. B.); Bernardo Cunha Matos (Matos, B. C.); Paula Carvalho (Carvalho, P.); Fernando Batista (Batista, F.); Ricardo Ribeiro (Ribeiro, R.);

OpenAccess Series in Informatics

Year (definitive publication)

2022

Language

English

Country

Germany

More Information

Visit Link

Web of Science®

This publication is not indexed in Web of Science®

Scopus

Times Cited: 7

(Last checked: 2025-04-02 02:20)

View record in Scopus

Google Scholar

Times Cited: 13

(Last checked: 2025-04-01 13:01)

View record in Google Scholar

Overton

This publication is not indexed in Overton

Abstract

With the increasing spread of hate speech (HS) on social media, it becomes urgent to develop models that can help detecting it automatically. Typically, such models require large-scale annotated corpora, which are still scarce in languages such as Portuguese. However, creating manually annotated corpora is a very expensive and time-consuming task. To address this problem, we propose an ensemble of two semi-supervised models that can be used to automatically create a corpus representative of online hate speech in Portuguese. The first model combines Generative Adversarial Networks and a BERT-based model. The second model is based on label propagation, and consists of propagating labels from existing annotated corpora to the unlabeled data, by exploring the notion of similarity. We have explored the annotations of three existing corpora (CO-HATE, ToLR-BR, and HPHS) in order to automatically annotate FIGHT, a corpus composed of geolocated tweets produced in the Portuguese territory. Through the process of selecting the best model and the corresponding setup, we have tested different pre-trained embeddings, performed experiments using different training subsets, labeled by different annotators with different perspectives, and performed several experiments with active learning. Furthermore, this work explores back translation as a mean to automatically generate additional hate speech samples. The best results were achieved by combining all the labeled datasets, obtaining 0.664 F1-score for the Hate Speech class in FIGHT.

Acknowledgements

Keywords

Hate speech,Semi-supervised learning,Semi-automatic annotation

Fields of Science and Technology Classification

Computer and Information Sciences - Natural Sciences
Languages and Literature - Humanities

Funding Records

Funding Reference	Funding Entity
HATE Covid-19 (Proj. 759274510)	Fundação para a Ciência e a Tecnologia
UIDB/50021/2020	Fundação para a Ciência e a Tecnologia
PTDC/CCI- CIF/32607/2017	Fundação para a Ciência e a Tecnologia

Contributions to the Sustainable Development Goals of the United Nations

With the objective to increase the research activity directed towards the achievement of the United Nations 2030 Sustainable Development Goals, the possibility of associating scientific publications with the Sustainable Development Goals is now available in Ciência_Iscte. These are the Sustainable Development Goals identified by the author(s) for this publication. For more detailed information on the Sustainable Development Goals, click here.

Publication Identifiers

Scopus (source: Ciência_Iscte)	2-s2.0-85136097791
DOI (source: author)	10.4230/OASIcs.SLATE.2022.11
Ciência_Iscte ID	ci-pub-89928
Handle (source: Ciência-IUL)	http://hdl.handle.net/10071/25973

Other Publication Details

Online Publication Year	2022
Publisher	Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Indexes	Scopus;
ISSN	2190-6807 (online)
ISBN	978-3-95977-245-7 (online)
Volume	104
Article Number	11
Pages	--	Total Pages	14
Peer Reviewed	Yes
Editors	Cordeiro, J., Pereira, M. J., Rodrigues, N. F., and Pais, S.
Event Title	11th Symposium on Languages, Applications and Technologies (SLATE 2022)
Event Organizer	Universidade da Beira Interior
City	Covilhã
Event Type	Conference
Event Classification	International
Event Year	2022
Event Publication Type	Full Paper
ISCTE-IUL Repository	Link to the repository
Publication Date (online)
Publication Date (print)

Altmetric

Dimensions

PlumX Metrics

Citations

Citation Indexes: 7

Captures

Readers: 7

see details