2018-10-012018-08-31OLIVEIRA, D. J. C. Junções por similaridade com expressões complexas em ambientes distribuídos. 2018. 61 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Goiás, Goiânia, 2018.http://repositorio.bc.ufg.br/tede/handle/tede/8928A recurrent problem that degrades the quality of the information in databases is the presence of duplicates, i.e., multiple representations of the same real-world entity. Despite being computationally expensive, the use of similarity operations is fundamental to identify duplicates. Furthermore, real-world data is typically composed of different attributes and each attribute represents a distinct type of information. The application of complex similarity expressions is important in this context because they allow considering the importance of each attribute in the similarity evaluation. However, due to a large amount of data present in Big Data applications, it has become crucial to perform these operations in parallel and distributed processing environments. In order to solve such problems of great relevance to organizations, this work proposes a novel strategy to identify duplicates in textual data by using similarity joins with complex expressions in a distributed environment.application/pdfAcesso AbertoJunção por similaridadeSistemas distribuídosApache sparkBig dataSimilarity joinsDistributed platformsCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOJunções por similaridade com expressões complexas em ambientes distribuídosSet similarity joins with complex expressions on distributed platformsDissertação