Research On Imbalanced Data Sparsity Problems

Posted on:2017-03-25

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Feng

Full Text:PDF

GTID:2348330482486644

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The sparsity problem of imbalanced data is one of the difficulties in the field of data mining. In recent years, with the continuous improvement of the global information technology and the rapid development of computer technology, data mining system has been successfully applied to the fields of medicine,telecommunications, finance, industrial production and so on. However, There is a widespread sparse imbalanced data sets in real application, this kind of data set has both features of sparsity and imbalance. On the macro data set on the category is not balanced, only a small type of data in a large amount of data is needed, it's often difficult to be identified and accurate classification. On the micro there are a lot of missing values in the data set, which is generally caused by various factors in the data collection process, if not take effective measures to deal with missing values in the pre-processing stage, it will affect the data classification in the next step,especially has great impact on the imbalanced data. Applications based on data mining using large amounts of historical data to obtain useful knowledge, the accuracy and efficiency of the system also greatly reduced due to the data sparsity and imbalance problems. Therefore, how to better address the sparsity problem of imbalanced data in the classification process attracted attention of many researchers and academics.This paper describes the research status of resolving the data sparsity problem and the imbalance problem, and in-depth discussions the root cause of data sparsity and imbalance, as well as solutions currently being used. For the sparse imbalanced data set, taking into account the complexity of its shape, we present a complete solution to resolve the problems caused by imbalance and sparsity. We use a method based on sparse data clustering to accomplish missing values imputation in the preprocessing of imbalanced data classification stage. First we propose a single-layer filling method based on sparse data clustering and collaborative filtering algorithm to complete the once-time imputation of missing values. Secondly, for the lack of single-layer imputation method we propose the recursive incremental clustering imputation method and validate its relatively the single-layer clustering method andtraditional methods on the accuracy and efficiency enhancement through the experiment. After the pre-processing stage outputting not sparse data, we propose a random walk model based imbalanced data classification method to resolve the imbalance problem. Then we use the filled data set with different imbalance ratio to do the comparative experiment and give the evaluation of classification results according to the ROC curve. Thus verify the importance of addressing the sparsity problem of imbalanced data and the effectiveness of the whole solution.

Keywords/Search Tags:

imbalanced data, sparsity, missing values imputation, classification

PDF Full Text Request

Related items

1	Researches On Imputation And Classification Of Incomplete Data Based On Variables For Missing Values
2	Research On Missing Value Imputation Method Based On Mixed Information System
3	Incomplete Data Modeling And Missing Value Imputation Based On Confidence
4	Researches On The Classification Of Imbalanced Data With Missing Values
5	Studies On Missing Data Imputation
6	Imbalanced-type Incomplete Data And Missing Value Imputations Based On TS Modeling
7	Research On Data Missing Problem Of Imbalanced Data Set
8	Multiple Imputation on Missing Values in Time Series Data
9	Research Of Missing Values Imputation Method Based On Quadrant Nearest Neighbors And DFT
10	Comparative Study On Imputation Methods Of Missing Data In XGBOOST Model Under Complete Random Missing Mechanism