Avaliação da qualidade da sintetização de fala gerada por modelos de redes neurais profundas
Nenhuma Miniatura disponível
Data
2023-05-26
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de Goiás
Resumo
With the emergence of intelligent personal assistants, the need for high-quality conversational
interfaces has increased. While text-based chatbots are popular, the development of voice interfaces
is equally important. However, the primary method for evaluating voice-based conversational models
is mainly done through Mean Opinion Score (MOS), which relies on a manual and subjective process.
In this context, this thesis aims to contribute with a new methodology for evaluating voice-based
conversational interfaces, with a case study specifically conducted in Brazilian Portuguese. The
proposed methodology includes an architecture for predicting the quality of synthesized speech in
Brazilian Portuguese, correlated with MOS. To evaluate the proposed methodology, this work
included training Text-to-Speech models to create the dataset called BRSpeechMOS. Details about the
creation of this dataset are presented, along with a qualitative and quantitative analysis of it. A series
of experiments were conducted to train various architectures using the BRSpeechMOS dataset. The
architectures used are based on supervised and self-supervised learning. The results obtained
confirm the hypothesis raised that pre-trained models on voice processing tasks such as speaker
verification and automatic speech recognition produce suitable acoustic representations for the task
of predicting speech quality, contributing to the advancement of the state of the art in the
development of evaluation methodologies for conversational models.
Descrição
Citação
OLIVEIRA, F. S. Avaliação da qualidade da sintetização de fala gerada por modelos de redes neurais profundas. 2023. 129 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal de Goiás, Goiânia, 2023.