Publication in conference proceedings
Automatic truecasing of video subtitles using BERT: a multilingual adaptable approach
Ricardo Rei (Ricardo Rei); Nuno Miguel Guerreiro (Nuno Miguel Guerreiro); Fernando Batista (Batista, F.);
Information Processing and Management of Uncertainty in Knowledge-Based Systems
Year (definitive publication)
2020
Language
English
Country
Portugal
More Information
Web of Science®

This publication is not indexed in Web of Science®

Scopus

Times Cited: 6

(Last checked: 2024-08-20 20:00)

View record in Scopus

Google Scholar

Times Cited: 11

(Last checked: 2024-08-23 06:30)

View record in Google Scholar

Abstract
This paper describes an approach for automatic capitalization of text without case information, such as spoken transcripts of video subtitles, produced by automatic speech recognition systems. Our approach is based on pre-trained contextualized word embeddings, requires only a small portion of data for training when compared with traditional approaches, and is able to achieve state-of-the-art results. The paper reports experiments both on general written data from the European Parliament, and on video subtitles, revealing that the proposed approach is suitable for performing capitalization, not only in each one of the domains, but also in a cross-domain scenario. We have also created a versatile multilingual model, and the conducted experiments show that good results can be achieved both for monolingual and multilingual data. Finally, we applied domain adaptation by finetuning models, initially trained on general written data, on video subtitles, revealing gains over other approaches not only in performance but also in terms of computational cost.
Acknowledgements
supported by national funds through FCT, Fundação para a Ciência e a Tecnologia, under project UIDB/50021/2020 and by PT2020 funds, under the project “Unbabel Scribe: AI-Powered Video Transcription and Subtitle” with the contract number: 038510.
Keywords
  • Other Engineering and Technology Sciences - Engineering and Technology
  • Electrical Engineering, Electronic Engineering, Information Engineering - Engineering and Technology
  • Languages and Literature - Humanities
Funding Records
Funding Reference Funding Entity
UIDB/50021/2020 FCT
038510 PT2020

With the objective to increase the research activity directed towards the achievement of the United Nations 2030 Sustainable Development Goals, the possibility of associating scientific publications with the Sustainable Development Goals is now available in Ciência-IUL. These are the Sustainable Development Goals identified by the author(s) for this publication. For more detailed information on the Sustainable Development Goals, click here.