2025-01-152025-01-152024-05-28GARCIA, E. A. S. Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora. 2024. 82 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2024.http://repositorio.bc.ufg.br/tede/handle/tede/13781This research investigates the application of Natural Language Processing (NLP) within the legal domain for the Portuguese language, emphasizing the importance of domain adaptation for pre-trained language models, such as RoBERTa, using specialized legal corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT, addressing the challenges of high near-duplicate document rates in legal corpora and conducting a comparison with generic web-scraped corpora. Experiments with these corpora revealed that pre-training on a combined dataset of legal and general data resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT, outperformed larger models trained solely on generic corpora, such as BERTimbau and Albertina-PT-*, and other legal models from similar works. For evaluating the performance of these models, we propose in this Master’s dissertation a legal benchmark composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions in the Brazilian legal context by openly providing enhanced models, a specialized corpus, and a rigorous benchmark suite.Attribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/Processamento de linguagem naturalModelo de linguagemDomínio legalBenchmark JurídicoNatural language processingLanguage model,Legal DomainLegal BenchmarkCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOLegal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal CorporaAdaptação de domínio Legal em Modelos de Linguagens em português - Desenvolvimento e avaliação de modelos baseados em RoBERTa em corpora legaisDissertação