Publication in conference proceedings Q3
Early experiments on automatic annotation of Portuguese medieval texts
Maria Inês Bico (Bico, M. I.); Jorge Baptista (Baptista, J.); Fernando Batista (Batista, F.); Esperança Cardeira (Cardeira, E.);
Linking theory and practice of digital libraries. Lecture Notes in Computer Science
Year (definitive publication)
2022
Language
English
Country
Italy
More Information
Web of Science®

Times Cited: 0

(Last checked: 2024-05-16 20:57)

View record in Web of Science®

Scopus

Times Cited: 1

(Last checked: 2024-05-13 08:42)

View record in Scopus


: 0.4
Google Scholar

Times Cited: 3

(Last checked: 2024-05-13 11:53)

View record in Google Scholar

Abstract
This paper presents the challenges and solutions adopted to the lemmatization and part-of-speech (PoS) tagging of a corpus of Old Portuguese texts (up to 1525), to pave the way to the implementation of an automatic annotation of these Medieval texts. A highly granular tagset, previously devised for Modern Portuguese, was adapted to this end. A large text (∼155 thousand words) was manually annotated for PoS and lemmata and used to train an initial PoS-tagger model. When applied to two other texts, the resulting model attained 91.2% precision with a textual variant of the same text, and 67.4% with a new, unseen text. A second model was then trained with the data provided by the previous three texts and applied to two other unseen texts. The new model achieved a precision of 77.3% and 82.4%, respectively.
Acknowledgements
Research for this paper was partially funded by public funds through FCT, proj.ref UIDB/50021/2020, proj.ref. UIDP/00214/2020, proj.ref. UI/BD/152806/2022
Keywords
Automatic annotation,Lemmatization,Part-of-speech tagging,Old portuguese
  • Mathematics - Natural Sciences
  • Computer and Information Sciences - Natural Sciences
Funding Records
Funding Reference Funding Entity
UIDB/50021/2020 Fundação para a Ciência e a Tecnologia
UI/BD/152806/2022 Fundação para a Ciência e a Tecnologia
UIDP/00214/2020 Fundação para a Ciência e a Tecnologia