Exploring Metric Correlations for Legal Text Summarization Evaluation

Martim Zanatti; Ricardo Ribeiro; Helena Sofia Pinto

Ciência_Iscte Publications Publication Detailed Description

Publication in conference proceedings

Exploring Metric Correlations for Legal Text Summarization Evaluation

Martim Zanatti (Martim Zanatti); Ricardo Ribeiro (Ribeiro, R.); Helena Sofia Pinto (Pinto, H. Sofia);

Proceedings of the Twentieth International Conference on Artificial Intelligence and Law

Year (definitive publication)

2025

Language

English

Country

More Information

Visit Link

Web of Science®

Times Cited: 0

(Last checked: 2026-05-02 21:42)

View record in Web of Science®

Scopus

Times Cited: 0

(Last checked: 2026-04-25 21:57)

View record in Scopus

Google Scholar

Times Cited: 0

(Last checked: 2026-05-02 18:25)

View record in Google Scholar

Overton

This publication is not indexed in Overton

Abstract

The rapid advancements in legal text summarization have not been matched by equivalent progress in evaluation metrics capable of assessing the quality of legal summaries. Traditional evaluation approaches, such as ROUGE, remain widely used despite their inability to capture semantic fidelity. While more recent metrics focus on semantic evaluation, their applicability to legal summarization has not been thoroughly tested, and their performance is highly dependent on embedding models and computational resources, particularly for long and complex legal texts. Furthermore, the absence of publicly available datasets with expert annotations hinders the development and validation of domain-specific evaluation methods. In this paper, we address these challenges by introducing the first publicly available dataset of Portuguese legal summaries, annotated by legal experts across multiple dimensions such as Coherence and Relevance. We use this dataset to systematically evaluate several recent evaluation metrics, comparing their performance against ROUGE, the standard metric for summarization tasks. Our analysis, based on Spearman correlation with human judgments, reveals that ROUGE-2 maintains the highest correlation across almost every evaluated dimension, outperforming more recent metrics, including semantic-based approaches. These results emphasize the challenges of adapting new evaluation frameworks to the legal domain and underscore the need for further research into metrics that can better capture domain-specific requirements.

Acknowledgements

Keywords

Legal Automatic EvaluationSemantic similarity,Lexical similarity,Spearman Correlation,Expert Evaluation Dataset

Funding Records

Funding Reference	Funding Entity
10.54499/UIDB/50021/2020	FCT
C645008882-00000055	EU (PRR)

Publication Identifiers

Scopus (source: Ciência_Iscte)	2-s2.0-105028619428
DOI (source: author)	10.1145/3769126.3769206
WoS (source: Ciência_Iscte)	WOS:001704118700043
Ciência_Iscte ID	ci-pub-115191

Other Publication Details

Online Publication Year	2025
Publisher	ACM
Indexes	Web of Science©; Scopus;
ISSN	--
ISBN	9798400719394 (print) 9798400719394 (online)
Volume
Article Number
Pages	389 - 393	Total Pages	--
Peer Reviewed	Yes
Dissemination Mean	Both (printed and digital)
Editors	Juliano Maranhão
Event Title	ICAIL 2025: The 20th International Conference on Artificial Intelligence and Law
Event Organizer
City	Chicago , IL , USA
Event Type	Conference
Event Classification	International
Event Year	2025
Event Publication Type	Full Paper
Publication Date (online)
Publication Date (print)

Altmetric

Dimensions

PlumX Metrics