Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
Nenhuma Miniatura disponível
Data
2024-05-28
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de Goiás
Resumo
This research investigates the application of Natural Language Processing (NLP) within
the legal domain for the Portuguese language, emphasizing the importance of domain
adaptation for pre-trained language models, such as RoBERTa, using specialized legal
corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT,
addressing the challenges of high near-duplicate document rates in legal corpora and
conducting a comparison with generic web-scraped corpora. Experiments with these
corpora revealed that pre-training on a combined dataset of legal and general data
resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT,
outperformed larger models trained solely on generic corpora, such as BERTimbau
and Albertina-PT-*, and other legal models from similar works. For evaluating the
performance of these models, we propose in this Master’s dissertation a legal benchmark
composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions
in the Brazilian legal context by openly providing enhanced models, a specialized corpus,
and a rigorous benchmark suite.
Descrição
Citação
GARCIA, E. A. S. Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora. 2024. 82 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2024.