Aprimoramento de dados para SFT em português brasileiro: um estudo com modelos de língua e avaliação com LLM-as- Judge

Carregando...
Imagem de Miniatura

Título da Revista

ISSN da Revista

Título de Volume

Editor

Universidade Federal de Goiás

Resumo

The scarcity of high-quality resources for Brazilian Portuguese (pt-br) hinders the development of effective language models adapted to the language's specificities. This work investigates the impact of synthetic enhancement of conversational data, using Large Language Models (LLMs), on the Supervised Fine-Tuning (SFT) of models from the Qwen2.5 family (0.5B, 1.5B, 3B). Based on the SmolTalk dataset, two versions were generated for ptbr: one by direct translation and another with responses synthetically enhanced and rewritten by the LLM Gemini 2.0 Flash. The Qwen2.5 models were trained on both datasets and comparatively evaluated using standardized objective benchmarks for Portuguese (ENEM, HATEBR, BLUEX, ASSIN2-RTE) and through qualitative evaluation of open-ended text generation (Alpaca-Eval-BR), using Claude 3.5 Haiku as LLM-as-Judge based on relevance, precision, comprehensiveness, usefulness, and coherence criteria. The results demonstrate a significant superiority of the models trained with synthetic data in the qualitative LLM-as- Judge evaluation across all metrics. In this evaluation, the normalized average F1-Score significantly increased with synthetic data: the 1.5B model achieved 44.45 (vs 14.05 for the translated, a ~216% gain) and the 3B model reached 57.21 (vs 16.79 for the translated, a ~241% gain). In contrast, on the objective benchmarks, the positive impact of synthetic enhancement was less pronounced, being more consistent only in the 3B parameter version. It is concluded that the LLM-assisted synthetic data enhancement strategy is effective in significantly raising the quality and performance of conversational language models for Brazilian Portuguese, representing a valuable approach to mitigate the scarcity of dedicated resources and advance the development of NLP technologies better adapted to the national context.synthetic data enhancement strategy is effective in significantly raising the quality and performance of conversational language models for Brazilian Portuguese, representing a valuable approach to mitigate the scarcity of dedicated resources and advance the development of NLP technologies better adapted to the national context.

Descrição

Citação

RIOS, W. S. R. Aprimoramento de dados para SFT em português brasileiro: um estudo com modelos de língua e avaliação com LLM-as- Judge. 2025. 42 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática , Universidade Federal de Goiás, Goiânia, 2025.