Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora

Nenhuma Miniatura disponível

Data

2024-05-28

Título da Revista

ISSN da Revista

Título de Volume

Editor

Universidade Federal de Goiás

Resumo

This research investigates the application of Natural Language Processing (NLP) within the legal domain for the Portuguese language, emphasizing the importance of domain adaptation for pre-trained language models, such as RoBERTa, using specialized legal corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT, addressing the challenges of high near-duplicate document rates in legal corpora and conducting a comparison with generic web-scraped corpora. Experiments with these corpora revealed that pre-training on a combined dataset of legal and general data resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT, outperformed larger models trained solely on generic corpora, such as BERTimbau and Albertina-PT-*, and other legal models from similar works. For evaluating the performance of these models, we propose in this Master’s dissertation a legal benchmark composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions in the Brazilian legal context by openly providing enhanced models, a specialized corpus, and a rigorous benchmark suite.

Descrição

Citação

GARCIA, E. A. S. Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora. 2024. 82 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2024.