Publication in conference proceedings
Exploring Metric Correlations for Legal Text Summarization Evaluation
Martim Zanatti (Martim Zanatti); Ricardo Ribeiro (Ribeiro, R.); Helena Sofia Pinto (Pinto, H. Sofia);
Proceedings of the Twentieth International Conference on Artificial Intelligence and Law
Year (definitive publication)
2025
Language
English
Country
--
More Information
Web of Science®

Times Cited: 0

(Last checked: 2026-05-02 21:42)

View record in Web of Science®

Scopus

Times Cited: 0

(Last checked: 2026-04-25 21:57)

View record in Scopus

Google Scholar

Times Cited: 0

(Last checked: 2026-05-02 18:25)

View record in Google Scholar

This publication is not indexed in Overton

Abstract
The rapid advancements in legal text summarization have not been matched by equivalent progress in evaluation metrics capable of assessing the quality of legal summaries. Traditional evaluation approaches, such as ROUGE, remain widely used despite their inability to capture semantic fidelity. While more recent metrics focus on semantic evaluation, their applicability to legal summarization has not been thoroughly tested, and their performance is highly dependent on embedding models and computational resources, particularly for long and complex legal texts. Furthermore, the absence of publicly available datasets with expert annotations hinders the development and validation of domain-specific evaluation methods. In this paper, we address these challenges by introducing the first publicly available dataset of Portuguese legal summaries, annotated by legal experts across multiple dimensions such as Coherence and Relevance. We use this dataset to systematically evaluate several recent evaluation metrics, comparing their performance against ROUGE, the standard metric for summarization tasks. Our analysis, based on Spearman correlation with human judgments, reveals that ROUGE-2 maintains the highest correlation across almost every evaluated dimension, outperforming more recent metrics, including semantic-based approaches. These results emphasize the challenges of adapting new evaluation frameworks to the legal domain and underscore the need for further research into metrics that can better capture domain-specific requirements.
Acknowledgements
--
Keywords
Legal Automatic EvaluationSemantic similarity,Lexical similarity,Spearman Correlation,Expert Evaluation Dataset
Funding Records
Funding Reference Funding Entity
10.54499/UIDB/50021/2020 FCT
C645008882-00000055 EU (PRR)