Ciência-IUL
Publicações
Descrição Detalhada da Publicação
Linking theory and practice of digital libraries. Lecture Notes in Computer Science
Ano (publicação definitiva)
2022
Língua
Inglês
País
Itália
Mais Informação
Web of Science®
Scopus
Google Scholar
Abstract/Resumo
This paper presents the challenges and solutions adopted to the lemmatization and part-of-speech (PoS) tagging of a corpus of Old Portuguese texts (up to 1525), to pave the way to the implementation of an automatic annotation of these Medieval texts. A highly granular tagset, previously devised for Modern Portuguese, was adapted to this end. A large text (∼155 thousand words) was manually annotated for PoS and lemmata and used to train an initial PoS-tagger model. When applied to two other texts, the resulting model attained 91.2% precision with a textual variant of the same text, and 67.4% with a new, unseen text. A second model was then trained with the data provided by the previous three texts and applied to two other unseen texts. The new model achieved a precision of 77.3% and 82.4%, respectively.
Agradecimentos/Acknowledgements
Research for this paper was partially funded by public funds through FCT, proj.ref UIDB/50021/2020, proj.ref. UIDP/00214/2020, proj.ref. UI/BD/152806/2022
Palavras-chave
Automatic annotation,Lemmatization,Part-of-speech tagging,Old portuguese
Classificação Fields of Science and Technology
- Matemáticas - Ciências Naturais
- Ciências da Computação e da Informação - Ciências Naturais
Registos de financiamentos
Referência de financiamento | Entidade Financiadora |
---|---|
UIDB/50021/2020 | Fundação para a Ciência e a Tecnologia |
UI/BD/152806/2022 | Fundação para a Ciência e a Tecnologia |
UIDP/00214/2020 | Fundação para a Ciência e a Tecnologia |