Font Size: a A A

Research On Imbalanced Data Sparsity Problems

Posted on:2017-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:X Y FengFull Text:PDF
GTID:2348330482486644Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The sparsity problem of imbalanced data is one of the difficulties in the field of data mining. In recent years, with the continuous improvement of the global information technology and the rapid development of computer technology, data mining system has been successfully applied to the fields of medicine,telecommunications, finance, industrial production and so on. However, There is a widespread sparse imbalanced data sets in real application, this kind of data set has both features of sparsity and imbalance. On the macro data set on the category is not balanced, only a small type of data in a large amount of data is needed, it's often difficult to be identified and accurate classification. On the micro there are a lot of missing values in the data set, which is generally caused by various factors in the data collection process, if not take effective measures to deal with missing values in the pre-processing stage, it will affect the data classification in the next step,especially has great impact on the imbalanced data. Applications based on data mining using large amounts of historical data to obtain useful knowledge, the accuracy and efficiency of the system also greatly reduced due to the data sparsity and imbalance problems. Therefore, how to better address the sparsity problem of imbalanced data in the classification process attracted attention of many researchers and academics.This paper describes the research status of resolving the data sparsity problem and the imbalance problem, and in-depth discussions the root cause of data sparsity and imbalance, as well as solutions currently being used. For the sparse imbalanced data set, taking into account the complexity of its shape, we present a complete solution to resolve the problems caused by imbalance and sparsity. We use a method based on sparse data clustering to accomplish missing values imputation in the preprocessing of imbalanced data classification stage. First we propose a single-layer filling method based on sparse data clustering and collaborative filtering algorithm to complete the once-time imputation of missing values. Secondly, for the lack of single-layer imputation method we propose the recursive incremental clustering imputation method and validate its relatively the single-layer clustering method andtraditional methods on the accuracy and efficiency enhancement through the experiment. After the pre-processing stage outputting not sparse data, we propose a random walk model based imbalanced data classification method to resolve the imbalance problem. Then we use the filled data set with different imbalance ratio to do the comparative experiment and give the evaluation of classification results according to the ROC curve. Thus verify the importance of addressing the sparsity problem of imbalanced data and the effectiveness of the whole solution.
Keywords/Search Tags:imbalanced data, sparsity, missing values imputation, classification
PDF Full Text Request
Related items