Named entity recognition for sensitive data discovery in Portuguese

Mariana Dias; João boné; Joao C Ferreira or Joao Ferreira; Ricardo Ribeiro; Rui Maia

Ciência_Iscte Publications Publication Detailed Description

Scientific journal paper Q2

Named entity recognition for sensitive data discovery in Portuguese

Mariana Dias (Dias, M. ); João boné (Boné, J.); Joao C Ferreira or Joao Ferreira (Ferreira, J.); Ricardo Ribeiro (Ribeiro, R.); Rui Maia (Maia, R.);

Journal Title

Applied Sciences

Year (definitive publication)

2020

Language

English

Country

Switzerland

More Information

Visit Link

Web of Science®

Times Cited: 28

(Last checked: 2026-04-11 22:29)

View record in Web of Science®

Article Impact Index: 2.1

Scopus

Times Cited: 32

(Last checked: 2026-04-07 21:50)

View record in Scopus

Article Impact Index: 1.7

Google Scholar

Times Cited: 52

(Last checked: 2026-04-13 02:17)

View record in Google Scholar

Overton

This publication is not indexed in Overton

Abstract

The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.

Acknowledgements

Keywords

Sensitive data,General data protection regulation,Natural language processing,Portuguese language,Named entity recognition

Fields of Science and Technology Classification

Computer and Information Sciences - Natural Sciences
Physical Sciences - Natural Sciences
Chemical Sciences - Natural Sciences
Other Natural Sciences - Natural Sciences
Civil Engineering - Engineering and Technology
Chemical Engineering - Engineering and Technology
Materials Engineering - Engineering and Technology

Publication Identifiers

Other ID (source: External)	cv-prod-id-1715958
Other ID (source: ORCID)	cv-prod-id-1715958
WoS (source: Ciência_Iscte)	WOS:000533356200102
WoS (source: External)	000533356200102
ISSN (source: ORCID)	2076-3417
DOI (source: ORCID)	10.3390/app10072303
Scopus (source: Ciência_Iscte)	2-s2.0-85083575201
DOI (source: author)	10.3390/app10072303
Scopus (source: External)	2-s2.0-85083575201
DOI (source: other)	10.3390/app10072303
Handle (source: other)	http://hdl.handle.net/10071/20414
ISSN (source: External)	2076-3417
Ciência_Iscte ID	ci-pub-70949
Handle (source: Ciência-IUL)	http://hdl.handle.net/10071/20414

Other Publication Details

Online Publication Year	2020
Publisher	MDPI
Indexes	Web of Science©; Scopus;
ISSN	2076-3417 (print) 2076-3417 (online)
ISBN	--
Impact Factor	--
Volume	10	Number	7
Series
Article Number	2303
Pages	--
Peer Reviewed	Yes
Dissemination Mean	Both (printed and digital)
ISCTE-IUL Repository	Link to the repository
Publication Date (online)
Publication Date (print)

Altmetric

Dimensions

PlumX Metrics