Font Size: a A A

Research And Application On Imbalanced Data Set Based On Ensemble Learning Classification

Posted on:2016-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:J L HuangFull Text:PDF
GTID:2298330467488366Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Imbalance data set is the number of a class of samples in the dataset thatoccupies most of the overall sample, this kind of data sample is known asnegative samples, which accounted for only a small part of the overall sample iscalled positive samples. Imbalance data sets not only in many fields have a widerange of applications, such as economics, biology, medicine, it is also the focusin data mining and machine learning field. Due to the traditional classificationalgorithms of data mining and machine learning always asking for the balance ofthe data centralized distribution, pursuing the accuracy of overall classification,which means the accuracy rate of classifier will be declined and even failurewhen classification the sample in the imbalance data set. Therefore, how toimprove the correction of the whole sample classification in the imbalance dataset and the classification accuracy of the positive samples in data set has becomea research hotpot in the field of data mining.This thesis firstly introduce the concept of imbalanced data set, the factorsinfluencing the accuracy of the imbalance data set classification, the algorithm tosolve the problem of imbalanced data set classification, the evaluation standardof classifier,s performance,the theory and basic algorithm of ensemble learning,and we have a deep comprehendsion about bagging the common ensemblelearning model. In order to solve the problem that the large amount of negativesamples submerged the positive samples in the imbalance data set and dataaliasing phenomenon of positive and negative samples make an impact on themodel of classification, this paper presents a sample pruning strategy based onKNN dynamic threshold. To solve the miss of attribute data value in theimbalanced data set, this paper proposed a method based on KNN algorithm tofill in the missing data set information. In the experimental realization andanalysis section, using representative forms of imbalanced data sets to verify the validity and feasibility of the proposed method. Finally, in the field of credit cardfraud detection, it is proposed that the strategy which have made goodclassification effect when integrated into the application system to solve theproblem of imbalanced data set classification in the domain of data and algorithm,using real data combined with random forest and bagging ensemble learningmethod.
Keywords/Search Tags:imbalanced data, knn, ensemble learning, bagging
PDF Full Text Request
Related items