Exportar Publicação

A publicação pode ser exportada nos seguintes formatos: referência da APA (American Psychological Association), referência do IEEE (Institute of Electrical and Electronics Engineers), BibTeX e RIS.

Exportar Referência (APA)
Dias, M. , Boné, J., Ferreira, J., Ribeiro, R. & Maia, R. (2020). Named entity recognition for sensitive data discovery in Portuguese. Applied Sciences. 10 (7)
Exportar Referência (IEEE)
M. Dias et al.,  "Named entity recognition for sensitive data discovery in Portuguese", in Applied Sciences, vol. 10, no. 7, 2020
Exportar BibTeX
@article{dias2020_1732725698170,
	author = "Dias, M.  and Boné, J. and Ferreira, J. and Ribeiro, R. and Maia, R.",
	title = "Named entity recognition for sensitive data discovery in Portuguese",
	journal = "Applied Sciences",
	year = "2020",
	volume = "10",
	number = "7",
	doi = "10.3390/app10072303",
	url = "https://www.mdpi.com/2076-3417/10/7/2303"
}
Exportar RIS
TY  - JOUR
TI  - Named entity recognition for sensitive data discovery in Portuguese
T2  - Applied Sciences
VL  - 10
IS  - 7
AU  - Dias, M. 
AU  - Boné, J.
AU  - Ferreira, J.
AU  - Ribeiro, R.
AU  - Maia, R.
PY  - 2020
SN  - 2076-3417
DO  - 10.3390/app10072303
UR  - https://www.mdpi.com/2076-3417/10/7/2303
AB  - The process of protecting sensitive data is continually growing and becoming increasingly
important, especially as a result of the directives and laws imposed by the European Union. The effort
to create automatic systems is continuous, but, in most cases, the processes behind them are still
manual or semi-automatic. In this work, we have developed a component that can extract and
classify sensitive data, from unstructured text information in European Portuguese. The objective
was to create a system that allows organizations to understand their data and comply with legal and
security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the
Portuguese language. This approach combines several techniques such as rule-based/lexical-based
models, machine learning algorithms, and neural networks. The rule-based and lexical-based
approaches were used only for a set of specific classes. For the remaining classes of entities, two
statistical models were tested—Conditional Random Fields and Random Forest and, finally, a
Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that
Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%.
With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and
testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.

ER  -