Font Size: a A A

Research On Classification Methods For Large-scale Imbalanced Data

Posted on:2015-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:P P FuFull Text:PDF
GTID:2298330431985570Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of information society, the way of human beings gets or store databecomes more billing and convenient, so the large-scale imbalanced data sets are emerging inpeople’s lives. Faced with these different scale imbalanced data sets, how to quickly,accurately and comprehensively find out useful information has become a remarkablechallenge for the current information technology or business fields. Data mining is anadvanced data analysis and processing technology. It has been widely used in health care,insurance, telecommunications, finance and other fields.Classification methods are one of the key technologies in data mining technology, andhave broad interest in academia. They have emerged in many classification algorithms. Thesealgorithms are effective in some degree. But with the ever-changing data, its shortcomings arebecoming increasingly prominent. From traditional static datasets to the current dynamic dataflow, the degree of imbalance is also rising with increasing its size. These new features to thetraditional methods of data classification are undoubtedly a remarkable challenge. That is tosay there is no hesitate but to overcome the questions. Therefore, how to design an effectivelydata classification model remains a focus of current researchers.For studying the data classification based on the data characteristics of large-scale andimbalanced, this paper proposes two data classification models, as follows:(1) Based on hierarchical clustering and re-sampling, this paper uses the training setreduction ideas to design large-scale data classification model for large datasets. The proposedmethod first uses k-means cluster algorithm to partition dataset into several subsets. Then, themethod clusters class by class for each subset and selects samples in each clustering centerneighborhood to form the final training dataset. Last, the method applies SVM to train andmodel for candidate training datasets. The experimental results show that the proposedmethod can substantially reduce SVM learning cost. Meanwhile, SVM learning can guaranteebetter classification accuracy and accelerate the training speed.(2) For static class imbalance data, this paper proposes a scaling kernel SVMclassification model based on chi-square test. First, the use of standard SVM to obtain anapproximate hyper plane in this model, each sample drawn to distance from the approximatehyper plane and an initial dataset is divided. Then, based on the idea of kernel functionamended class distribution, a new kernel transformation method with a combination ofconformal transformation chi-square test was proposed, using the method of class boundaries constantly amended to expand the class of asymmetric spatial boundaries. Finally, SVMclassification model was one more established. Experimental results show that the method canskew compensation data, has higher classification accuracy.
Keywords/Search Tags:classification, imbalanced data, support vector machines, scaling kernel SVM
PDF Full Text Request
Related items