Aprimoramento de dados para SFT em português brasileiro: um estudo com modelos de língua e avaliação com LLM-as- Judge
Carregando...
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de Goiás
Resumo
The scarcity of high-quality resources for Brazilian Portuguese (pt-br) hinders the development of effective language models adapted to the language's specificities. This work investigates the impact of synthetic enhancement of conversational data, using Large Language Models (LLMs), on the Supervised Fine-Tuning (SFT) of models from the Qwen2.5 family (0.5B, 1.5B, 3B). Based on the SmolTalk dataset, two versions were generated for ptbr: one by direct translation and another with responses synthetically enhanced and rewritten by the LLM Gemini 2.0 Flash. The Qwen2.5 models were trained on both datasets and comparatively evaluated using standardized objective benchmarks for Portuguese (ENEM, HATEBR, BLUEX, ASSIN2-RTE) and through qualitative evaluation of open-ended text generation (Alpaca-Eval-BR), using Claude 3.5 Haiku as LLM-as-Judge based on relevance, precision, comprehensiveness, usefulness, and coherence criteria. The results demonstrate a significant superiority of the models trained with synthetic data in the qualitative LLM-as- Judge evaluation across all metrics. In this evaluation, the normalized average F1-Score significantly increased with synthetic data: the 1.5B model achieved 44.45 (vs 14.05 for the translated, a ~216% gain) and the 3B model reached 57.21 (vs 16.79 for the translated, a ~241% gain). In contrast, on the objective benchmarks, the positive impact of synthetic enhancement was less pronounced, being more consistent only in the 3B parameter version. It is concluded that the LLM-assisted synthetic data enhancement strategy is effective in significantly raising the quality and performance of conversational language models for Brazilian Portuguese, representing a valuable approach to mitigate the scarcity of dedicated resources and advance the development of NLP technologies better adapted to the national context.synthetic data enhancement strategy is effective in significantly raising the quality and performance of conversational language models for Brazilian Portuguese, representing a valuable approach to mitigate the scarcity of dedicated resources and advance the development
of NLP technologies better adapted to the national context.
Descrição
Palavras-chave
Citação
RIOS, W. S. R. Aprimoramento de dados para SFT em português brasileiro: um estudo com modelos de língua e avaliação com LLM-as- Judge. 2025. 42 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática , Universidade Federal de Goiás, Goiânia, 2025.