Recognizing and linking named entities in Portuguese medieval texts

Bico, Maria Inês; Jorge Baptista; Fernando Batista; Cardeira, Esperança

Ciência_Iscte Publicações Descrição Detalhada da Publicação

Capítulo de livro

Recognizing and linking named entities in Portuguese medieval texts

Bico, Maria Inês (Bico, M. I.); Jorge Baptista (Baptista, J.); Fernando Batista (Batista, F.); Cardeira, Esperança (Cardeira, E.);

Título Livro

Digital humanities looking at the world: Exploring innovative approaches and contributions to society

Ano (publicação definitiva)

2024

Língua

Inglês

País

Suíça

Mais Informação

Visitar Link

Web of Science®

Esta publicação não está indexada na Web of Science®

Scopus

N.º de citações: 0

(Última verificação: 2026-06-25 22:38)

Ver o registo na Scopus

Google Scholar

N.º de citações: 0

(Última verificação: 2026-06-23 21:24)

Ver o registo no Google Scholar

Overton

Esta publicação não está indexada no Overton

Abstract/Resumo

Despite the continuous development of approaches and tools for Named Entity Recognition (NER), historical texts still face issues that modern ones do not have. These issues relate to the nature and type of the documents, the time period when they were produced, the diachronic difference in the language and how they were extracted from their source and preserved in their digital form. This paper addresses the challenges of identifying, recognizing, and categorizing named entities (NE) in Old and Middle Portuguese. We also briefly address the challenge of disambiguating Named Entities. The Portuguese Corpus of Ancient Texts consists of texts dating back to the 13th century up to 1525. All texts are transcribed and preserved with little editorial intervention, in XML format, using the web-based platform TEITOK. A part-of-speech (PoS) automatic annotation model was created and applied to six texts of the corpus, to enrich them and improve search queries. Following the automatic annotation task, a manual correction step ensued, so that more than half a million tokens have been lemmatized and annotated with their respective part-of-speech and inflection. For the NER task, the method’s pipeline is presented, following the annotation of NE of a corpus of +400k tokens. In addition, the criteria for determining the boundaries of the NE were established, namely for the identification of name-chains and composite entities that comprise more than one word. For this task, the immediate context of the words was considered. Concerning disambiguation, this paper presents how ambiguity is pervasive within the texts and across the corpus.

Agradecimentos/Acknowledgements

Palavras-chave

Registos de financiamentos

Referência de financiamento	Entidade Financiadora
UIDB/50021/2020	Fundação para a Ciência e a Tecnologia
UI/BD/152806/2022	Fundação para a Ciência e a Tecnologia
UIDP/00214/2020	Fundação para a Ciência e a Tecnologia

Identificadores da Publicação

Scopus (fonte: autor)	2-s2.0-85205613517
DOI (fonte: autor)	10.1007/978-3-031-48941-9_25
Scopus (fonte: Ciência_Iscte)	2-s2.0-85205613517
Handle (fonte: Ciência-IUL)	http://hdl.handle.net/10071/32997
ID Ciência_Iscte	ci-pub-104145

Outros Detalhes da Publicação

Ano Publicação Online	2024
Editora	Springer Nature
Indexação	Scopus;
ISSN	--
ISBN	978-3-031-48940-2 (print) 978-3-031-48941-9 (online)
Volume
Série		Fascículo/TOMO
Páginas	329 - 337	Total Páginas	9
Edição	--
Avaliado Cientificamente	Sim
Editores	Sílvia Araújo; Micaela Aguiar; Liana Ermakova
Repositório ISCTE-IUL	Link para o repositório
Data Publicação (online)	2024-04-20
Data Publicação (print)

Altmetric

Dimensions

PlumX Metrics