Research On Classification Methods For Large-scale Imbalanced Data

Posted on:2015-12-01

Degree:Master

Type:Thesis

Country:China

Candidate:P P Fu

Full Text:PDF

GTID:2298330431985570

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of information society, the way of human beings gets or store databecomes more billing and convenient, so the large-scale imbalanced data sets are emerging inpeople’s lives. Faced with these different scale imbalanced data sets, how to quickly,accurately and comprehensively find out useful information has become a remarkablechallenge for the current information technology or business fields. Data mining is anadvanced data analysis and processing technology. It has been widely used in health care,insurance, telecommunications, finance and other fields.Classification methods are one of the key technologies in data mining technology, andhave broad interest in academia. They have emerged in many classification algorithms. Thesealgorithms are effective in some degree. But with the ever-changing data, its shortcomings arebecoming increasingly prominent. From traditional static datasets to the current dynamic dataflow, the degree of imbalance is also rising with increasing its size. These new features to thetraditional methods of data classification are undoubtedly a remarkable challenge. That is tosay there is no hesitate but to overcome the questions. Therefore, how to design an effectivelydata classification model remains a focus of current researchers.For studying the data classification based on the data characteristics of large-scale andimbalanced, this paper proposes two data classification models, as follows:(1) Based on hierarchical clustering and re-sampling, this paper uses the training setreduction ideas to design large-scale data classification model for large datasets. The proposedmethod first uses k-means cluster algorithm to partition dataset into several subsets. Then, themethod clusters class by class for each subset and selects samples in each clustering centerneighborhood to form the final training dataset. Last, the method applies SVM to train andmodel for candidate training datasets. The experimental results show that the proposedmethod can substantially reduce SVM learning cost. Meanwhile, SVM learning can guaranteebetter classification accuracy and accelerate the training speed.(2) For static class imbalance data, this paper proposes a scaling kernel SVMclassification model based on chi-square test. First, the use of standard SVM to obtain anapproximate hyper plane in this model, each sample drawn to distance from the approximatehyper plane and an initial dataset is divided. Then, based on the idea of kernel functionamended class distribution, a new kernel transformation method with a combination ofconformal transformation chi-square test was proposed, using the method of class boundaries constantly amended to expand the class of asymmetric spatial boundaries. Finally, SVMclassification model was one more established. Experimental results show that the method canskew compensation data, has higher classification accuracy.

Keywords/Search Tags:

classification, imbalanced data, support vector machines, scaling kernel SVM

PDF Full Text Request

Related items

1	Research On Support Vector Machine Classification Method For Imbalanced Datasets
2	Research On Outlier Detection Based On Support Vector Machines
3	Research On Methods Of Imbalanced Data Set Classification
4	Research On Classification Algorithm For Imbalanced Data Sets Based On Support Vector Machines
5	The Research Of Classification Algorithm Based On Support Vector Machine
6	An Improved Classification Algorithm Of SVM For Learning Unbalanced Datasets
7	Classification Methods Based On Support Vector Machines And Manifold Learning
8	Three Classification Algorithms Based On Nonparallel Support Vector Machines
9	The Algorithm Research And Verification Of Support Vector Machines Based On Different Kernel Functions
10	Research And Applications On Intrusion Detection Based On Support Vector Machines For Imbalanced Datasets