Font Size: a A A

Research On Classification Algorithms Of Data Mining Based On Imbalanced Data Sets

Posted on:2018-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:M Y LiuFull Text:PDF
GTID:2348330536480350Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
The 21 st century is a highly era where the data as a carrier hides a lot of useful information that can be tapped,how to process data and extract valuable information has become an imminent problem.Classification is one of important research branch of data mining field,it is a kind of important form of data analysis.In real life,there are important research value which is often the scarcity of data,referred to as unbalanced data sets.So how to extract effectively minority class data set in unbalanced data set is done in this thesis.The main study is as follows:(1)According to the unbalanced data set of minority class classification accuracy is not high,it proposes an integrated and improved C4.5 Naive Bayes(C4.5-INB)algorithm.To deal with the classification problem,an integration(C4.5-INB)algorithm of C4.5 and improved Naive Bayesian is proposed in this paper.Firstly,the classification results are obtained by improved Naive Bayesian which the most class probability is multiplied by a proportion coefficient.The original data is classified by using C4.5 algorithm.According to the two classification results,the weights of two algorithms are determined with the equal right method or the optimal collocation preference method.Finally,a new classification result is obtained by the average vote method.In the classification to verify the three algorithms using UCI data,the results show that the algorithm proposed in this paper is more accurate classification results,better stability.(2)Aiming at the problems of noise data and low classification accuracy in classification process of unbalanced data sets,an active learning SVM classification algorithm based on improved SMOTE is proposed.This algorithm uses the attribution values of the minority class samples for training sample set to choose and control the number of synthetic minority class samples by majority vote method.According to the distance formula,the hyperplane is determined.The same number of majority class samples which are closest the classification hyperplane are selected and form a balanced sample data set.Support vector machine(SVM)is used to classify and obtain an optimal classifier.Then active learning is used to the unbalanced data set which is removed the training samples to circulate classification until sample of the unbalanced data set is null by using the optimal classifier.Using UCI data,the experimental results show that the proposed algorithm can effectively reduce noiseinfluence for data classification and improves classification accuracy of unbalanced data sets.(3)In view of the poor classification problem of high dimensional unbalanced data sets,an improved ULDP(I-ULDP)classification algorithm for high-dimension imbalanced is proposed.Firstly,this algorithm divides a sample into local small block structure on the same manifold,and makes each sample belong to its own manifold,which is located in the same feature subspace.Secondly,each minimum local embedding and maximum global variance of the manifold are constructed.By utilizing the optimal solution to the objective function,the objective function is resolved and the low dimensional manifold embedded in high dimensional space is obtained.Finally by setting classification hyperplane support vector machine(SVM)through manifold distance,the precise classification of unbalanced data sets is obtained by training support vector machine(SVM)is obtained.The experimental results on UCI data sets show that the proposed algorithm the superiority of reduction dimension and classification for high-dimensional unbalanced data sets.
Keywords/Search Tags:data mining, unbalanced data sets, classification, Naive Bayes algorithm, SMOTE, support vector machine, ULDP, dimension reduction
PDF Full Text Request
Related items