Font Size: a A A

Research On Data Missing Problem Of Imbalanced Data Set

Posted on:2017-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:T T ZhangFull Text:PDF
GTID:2348330482986414Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Imbalanced data set is a widespread data form in the area of data mining. Due to the wide gap of quantity of different categories data samples, the effect of normal classification algorithm is not obvious. In the field of data mining data missing is also an inevitable problem. The data sets in the collection or storage lead to data values missing or attribute missing due to environmental factors and so on, and the results may be missing the knowledge of data information. The imbalanced data sets and missing data sets brought difficulties to the data analysis and knowledge discovery, so the research of such data sets have been attracted more and more attention. With the rapid development of computer technology, the classification problems basing on data mining and machine learning become the method of highspeed decision, accurate judgment and effective auxiliary of enterprise and organization. And the imbalanced data sets with missing data generally exist in computer science, bioinformatics, economics and other fields of application, for the imbalance that people often care about the minority classes, and for missing data people often concern about the missing of useful information. So it is especially important for the processing of such data sets.This paper first describes the problem of imbalanced data sets and data missing, and summarizes the achievements of such data sets by domestic and foreign experts. It expounds the classification influence of imbalanced data sets with missing data, the general processing methods and the performance evaluation standard of classifier. The data values missing and attributes missing are also described in detail. Making the best use of existing data in the data set, this paper proposes a data values imputed strategy which based on density clustering and grey relational analysis technology. At the same time, it applies transfer learning to deal with attributes missing in data set, uses spectral feature alignment algorithm to enhance the attributes. And it combines with the boundary of cluster sampling method based on density clustering to solve the samples imbalanced problem in the data set. Use support vector machine as the classification model to classify the data set after above steps. Finally, the processing problem of imbalanced data set with data missing applies to the computer-aided medical diagnosis based on data mining. Use real medical data sets to verify the method proposed in this paper. It can achieve a good classification effect and provide assistance to the doctor's diagnosis.
Keywords/Search Tags:imbalanced data set, data missing, data value imputation, transfer learning
PDF Full Text Request
Related items