Exploiting parallelism in document similarity tasks with applications
Nenhuma Miniatura disponível
Data
2019-09-05
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de Goiás
Resumo
The amount of data available continues to grow rapidly and much of it corresponds to text
expressing human language, that is unstructured in nature. One way of giving some structure
to this data is by converting the documents to a vector of features corresponding to word
frequencies (term count, tf-idf, etc) or word embeddings. This transformation allows us to
process textual data with operations such as similarity measure, similarity search,
classification, among others. However, this is only possible thanks to more sophisticated
algorithms which demand higher computational power. In this work, we exploit parallelism to
enable the use of parallel algorithms to document similarity tasks and apply some of the
results to an important application in software engineering. The similarity search for textual
data is commonly performed through a k nearest neighbor search in which pairs of document
vectors are compared and the k most similar are returned. For this task we present FaSSTkNN,
a fine-grain parallel algorithm, that applies filtering techniques based on the most
common important terms of the query document using tf-idf. The algorithm implemented on a
GPU improved the top k nearest neighbors search by up to 60x compared to a baseline, also
running on a GPU. Document similarity using tf-idf is based on a scoring scheme for words that
reflects how important a word is to a document in a collection. Recently a more sophisticated
similarity measure, called word embedding, has become popular. It creates a vector for each
word that indicates co-occurrence relationships between words in a given context, capturing
complex semantic relationships between words. In order to generate word embeddings
efficiently, we propose a fine-grain parallel algorithm that finds the k less similar or farthest
neighbor words to generate negative samples to create the embeddings. The algorithm
implemented on a multi-GPU system scaled linearly and was able to generate embeddings 13x
faster than the original multicore Word2Vec algorithm while keeping the accuracy of the
results at the same level as those produced by standard word embedding programs. Finally,
we applied our accelerated word embeddings solution to the problem of assessing the quality
of fixes in Automated Software Repair. The proposed implementation was able to deal with
large corpus, in a computationally efficient way, being a promising alternative to the
processing of million source code files needed for this task.
Descrição
Citação
AMORIM, Leonardo Afonso. Exploiting parallelism in document similarity tasks with applications. 2019. 111 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal de Goiás, Goiânia, 2019.