BM25 x Vila Sésamo: avaliando modelos Sentence-BERT para Recuperação de Informação no cenário legislativo brasileiro

2026-03-032026-03-032025VITÓRIO, Douglas Álisson Marques de Sá et al. BM25 x Vila Sésamo: avaliando modelos Sentence-BERT para Recuperação de Informação no cenário legislativo brasileiro. Linguamática, Braga, v. 17, n. 1, p. 17-33, 2025. DOI: 10.21814/lm.17.1.474. Disponível em: https://linguamatica.com/index.php/linguamatica/pt/article/view/474. Acesso em: 17 fev. 2026.e- 1647-0818https://repositorio.bc.ufg.br//handle/ri/29813BERT-based models have been largely used, becoming the state-of-the-art for many Natural Language Processing tasks and for Information Retrieval. The Sentence-BERT architecture allowed these models to be easily used for the semantic search of documents, as it generates contextual embeddings that can be compared using similairty measures. To further investigate the application of BERT-based models for Information Retrieval, this work assessed 12 publicly available Sentence-BERT models for documents re- trieval within the Brazilian legislative scenario. Two BM25 variants were used as baseline: Okapi BM25 and BM25L. BM25L achieved better results, considering statistical significance, even in the scenario in which the documents were not preprocessed, while only one language model, fine-tuned using Brazilian legislative data, could reach a similar performance for one of the three used datasets.engAcesso Abertohttp://creativecommons.org/licenses/by-nc-nd/4.0/Recuperação de informaçãoDocumentos legislativosModelos de linguagemBERTBM25Information retrievalLegislative documentsLanguage modelsBM25 x Vila Sésamo: avaliando modelos Sentence-BERT para Recuperação de Informação no cenário legislativo brasileiroBM25 vs. Sesame Street: assessing Sentence-BERT models for Information Retrieval within the Brazilian legislative scenarioArtigo10.21814/lm.17.1.474