UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

Joseph Marvin Imperial; Abdullah Barayan; Regina Stodden; Rodrigo Wilkens; Ricardo Muñoz Sánchez; Lingyun Gao; Melissa Torgbi; Dawn Knight; Gail Forey; Reka R. Jablonkai; Ekaterina Kochmar; Robert Reynolds; Eugénio Ribeiro; Horacio Saggion; Elena Volodina; Sowmya Vajjala; Thomas François; Fernando Alva-Manchego; Harish Tayyar Madabushi

Ciência_Iscte Publications Publication Detailed Description

Publication in conference proceedings

UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

Joseph Marvin Imperial (Imperial, J.M.); Abdullah Barayan (Barayan, A.); Regina Stodden (Stodden, R.); Rodrigo Wilkens (Wilkens, R.); Ricardo Muñoz Sánchez (Muñoz Sánchez, R.); Lingyun Gao (Gao, L.); Melissa Torgbi (Torgbi, M.); Dawn Knight (Knight, D.); Gail Forey (Forey, G.); Reka R. Jablonkai (Jablonkai, R.R.); Ekaterina Kochmar (Kochmar, E.); Robert Reynolds (Reynolds, R.); Eugénio Ribeiro (Ribeiro, E.); Horacio Saggion (Saggion, H.); Elena Volodina (Volodina, E.); Sowmya Vajjala (Vajjala, S.); Thomas François (François, T.); Fernando Alva-Manchego (Alva-Manchego, F.); Harish Tayyar Madabushi (Tayyar Madabushi, H.); et al.

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Year (definitive publication)

2025

Language

English

Country

China

More Information

Visit Link

Web of Science®

This publication is not indexed in Web of Science®

Scopus

This publication is not indexed in Scopus

Google Scholar

Times Cited: 18

(Last checked: 2026-04-28 08:11)

View record in Google Scholar

Overton

This publication is not indexed in Overton

Abstract

We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.

Acknowledgements

Keywords

Associated Records

This publication is associated with the following record:

UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

Publication Identifiers

DOI (source: author)	10.18653/v1/2025.emnlp-main.491
Ciência_Iscte ID	ci-pub-113617

Other Publication Details

Online Publication Year	2025
Publisher	Association for Computational Linguistics
Indexes	--
ISSN	--
ISBN	979-8-89176-332-6 (online)
Volume
Article Number
Pages	9714 - 9766	Total Pages	53
Peer Reviewed	Yes
Editors	Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Event Title	Conference on Empirical Methods in Natural Language Processing (EMNLP)
Event Organizer
City	Suzhou, China
Event Type	Conference
Event Classification	International
Event Year	2026
Event Publication Type	--
Publication Date (online)
Publication Date (print)

Altmetric

Dimensions

PlumX Metrics