Junções por similaridade aproximadas em espaços vetoriais densos
Nenhuma Miniatura disponível
Data
2023-08-24
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de Goiás
Resumo
Similarity Join is an operation that returns pairs of objects whose similarity is greater than
or equal to a specified threshold, and is essential for tasks such as cleaning, mining, and
data integration. A common approach is to use data vector representations, such as the TFIDF method, and measure the similarity between vectors using the cosine function. However, computing the similarity for all pairs of vectors can be computationally prohibitive
on large data sets. Traditional algorithms exploit the sparsity of vectors and apply filters
to reduce the comparison space. Recently, advances in natural language processing have
produced in semantically richer vectors, improving the results quality. However, these
vectors have different characteristics from those generated by traditional methods, being
dense and of high dimensionality. Preliminary experiments demonstrated that L2AP, the
best known algorithm for similarity join, is not efficient for dense vector spaces. Due to
the intrinsic characteristics of such vectors, approximate solutions based on specialized
indices are predominant for dealing with large datasets. In this context, we investigate
how to perform similarity joins using the Hierarchical Navigable Small World (HNSW),
a state-of-the-art graph-based index designed for approximate k-nearest neighbor (kNN)
queries. We explored the design space of possible solutions, ranging from top-end alternatives to HNSW to deeper integration of similarity join processing into this framework.
The experiments carried out demonstrated accelerations of up to 2.48 and 3.47 orders of
magnitude in relation to the exact method and the baseline approach, respectively, maintaining recovery rates close to 100%.
Descrição
Palavras-chave
Citação
SANTANA, Douglas R. Junções por similaridade aproximadas em espaços vetoriais densos. 2023. 101 p. Dissertação (Mestrado em Ciência da computação) - Instituto de informática, Universidade federal de Goiás, 2023.