Balanceamento de dados com base em oversampling em dados transformados

Nenhuma Miniatura disponível

Data

2020-08-17

Título da Revista

ISSN da Revista

Título de Volume

Editor

Universidade Federal de Goiás

Resumo

Introduction: The efficiency and reliability of data analyses depends heavily on the quality of the analyzed data. The fundamental process of preparing databases in order to make them cleaner, more representative and improve their quality is called data preprocessing, during which data balancing is also performed. The importance of data balancing lies in the fact that several classification models commonly employed in enterprises and academic projects are designed to work with balanced data sets, and there are several factors which hinder classification performance which are associated to data imbalance. Objective: A new approach for data balancing based on data transformation combined with resampling of transformed data is proposed. The proposed approach transforms the original data set by transforming its input variables into new ones, therefore altering the data samples' position in the dimensional plane and consequently the choice that SMOTE-based resampling algorithms make over the initial samples, their nearest neighbours and where to place the generated synthetic samples. Methods: An initial implementation based on Principal Component Analysis (PCA) and SMOTE is presented, called PCA-SMOTE. In order to test the quality of the balancing performed by PCA-SMOTE, twelve test data sets were balanced through PCA-SMOTE and three other popular data balancing methods, and the performance of three classification models trained on these balanced sets are assessed and compared. Results: Several classification models trained on data sets which were balanced using the proposed method presented higher or similar performance measures in comparison to the same models trained on data sets that were balanced through the other evaluated algorithms, such as Borderline-SMOTE, Safe-Level-SMOTE and ADASYN. Conclusion: The satisfactory results obtained prove the potential of the proposed algorithm to improve learning of classifiers on imbalanced data sets.

Descrição

Citação

MAIONE, C. Balanceamento de dados com base em oversampling em dados transformados. 2020. 135 f. Tese (Doutorado em Ciência da Computação em Rede) - Universidade Federal de Goiás, Goiânia, 2020.