Ciência-IUL
Publications
Publication Detailed Description
Linking theory and practice of digital libraries. Lecture Notes in Computer Science
Year (definitive publication)
2022
Language
English
Country
Italy
More Information
Web of Science®
Scopus
Google Scholar
Abstract
This paper presents the challenges and solutions adopted to the lemmatization and part-of-speech (PoS) tagging of a corpus of Old Portuguese texts (up to 1525), to pave the way to the implementation of an automatic annotation of these Medieval texts. A highly granular tagset, previously devised for Modern Portuguese, was adapted to this end. A large text (∼155 thousand words) was manually annotated for PoS and lemmata and used to train an initial PoS-tagger model. When applied to two other texts, the resulting model attained 91.2% precision with a textual variant of the same text, and 67.4% with a new, unseen text. A second model was then trained with the data provided by the previous three texts and applied to two other unseen texts. The new model achieved a precision of 77.3% and 82.4%, respectively.
Acknowledgements
Research for this paper was partially funded by public funds through FCT, proj.ref UIDB/50021/2020, proj.ref. UIDP/00214/2020, proj.ref. UI/BD/152806/2022
Keywords
Automatic annotation,Lemmatization,Part-of-speech tagging,Old portuguese
Fields of Science and Technology Classification
- Mathematics - Natural Sciences
- Computer and Information Sciences - Natural Sciences
Funding Records
Funding Reference | Funding Entity |
---|---|
UIDB/50021/2020 | Fundação para a Ciência e a Tecnologia |
UI/BD/152806/2022 | Fundação para a Ciência e a Tecnologia |
UIDP/00214/2020 | Fundação para a Ciência e a Tecnologia |