Exploiting parallelism in document similarity tasks with applications

Nenhuma Miniatura disponível

Data

2019-09-05

Título da Revista

ISSN da Revista

Título de Volume

Editor

Universidade Federal de Goiás

Resumo

The amount of data available continues to grow rapidly and much of it corresponds to text expressing human language, that is unstructured in nature. One way of giving some structure to this data is by converting the documents to a vector of features corresponding to word frequencies (term count, tf-idf, etc) or word embeddings. This transformation allows us to process textual data with operations such as similarity measure, similarity search, classification, among others. However, this is only possible thanks to more sophisticated algorithms which demand higher computational power. In this work, we exploit parallelism to enable the use of parallel algorithms to document similarity tasks and apply some of the results to an important application in software engineering. The similarity search for textual data is commonly performed through a k nearest neighbor search in which pairs of document vectors are compared and the k most similar are returned. For this task we present FaSSTkNN, a fine-grain parallel algorithm, that applies filtering techniques based on the most common important terms of the query document using tf-idf. The algorithm implemented on a GPU improved the top k nearest neighbors search by up to 60x compared to a baseline, also running on a GPU. Document similarity using tf-idf is based on a scoring scheme for words that reflects how important a word is to a document in a collection. Recently a more sophisticated similarity measure, called word embedding, has become popular. It creates a vector for each word that indicates co-occurrence relationships between words in a given context, capturing complex semantic relationships between words. In order to generate word embeddings efficiently, we propose a fine-grain parallel algorithm that finds the k less similar or farthest neighbor words to generate negative samples to create the embeddings. The algorithm implemented on a multi-GPU system scaled linearly and was able to generate embeddings 13x faster than the original multicore Word2Vec algorithm while keeping the accuracy of the results at the same level as those produced by standard word embedding programs. Finally, we applied our accelerated word embeddings solution to the problem of assessing the quality of fixes in Automated Software Repair. The proposed implementation was able to deal with large corpus, in a computationally efficient way, being a promising alternative to the processing of million source code files needed for this task.

Descrição

Citação

AMORIM, Leonardo Afonso. Exploiting parallelism in document similarity tasks with applications. 2019. 111 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal de Goiás, Goiânia, 2019.