Font Size: a A A

Research On Application Of Classification Algorithms For Imbalanced Data

Posted on:2015-05-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y QianFull Text:PDF
GTID:1228330467956786Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The class-imbalanced data widely exists in the real world. In some research areas,the classification accuracy of those minority class samples is often of moreimportance than the majority classes. But most of the traditional classificationalgorithms often assume that the prior distribution of samples is uniform ormisclassification costs are equal. When dealing with imbalanced data set using thesealgorithms, the minority class sample information is often veiled by the majority andtheir classification error is much higher than that of the majority classes.Therefore, thestudy of imbalanced data classification has received an increasing attention in therelated fields. Focused on the class-imbalanced data problem, several novelalgorithms have been proposed from changing the samples distribution, integratinglearning and improving classification algorithm. The new algorithms are applied tothe benchmark UCI datasets and real data and achieve satisfied results. The maincontributions of this paper are as follows:(1) Focused on the classification problems for imbalanced datasets, a novelresampling ensemble algorithm is presented and is applied to the benchmark UCIdatasets. Firstly, we use resampling technology to preprocess the imbalanced data,which obtains balanced trainsets relatively. Secondly, the classical algorithms such asback propagation neural network(BPNN), k-Nearest Neighbour(kNN) and Na veBayes(NB) are used as basic classifiers to do ensembling. When all the classifiers aretrained, they are ensemble according to bagging strategy. To verify the effectivenessof the algorithm, F-measure and G-mean metrics are used to to evaluate theperformance of classifiers. In order to meet the sample actual distribution and reduceinformation redundancy, the Synthetic Minority Over-sampling Technique(SMOTE)worked on classification boundary is used to synthesize the minority class samples. Under-sampling technique with random and replacement is used to produce themajority sample sets. A resampling model is proposed instead of blind and randomselection. It employs Na ve Bayes algorithm performs on multiple sets of binarydatasets. According to the analysis and experimental results ondifferent resampling scales, we get the conclusion that the final resampling scale isbetter to determine by the ratio of the min class and max class count. Anotherobservation is that the relationship between the ratio of the minclass count, attributesnumber and the performance (F-measure) of method is in accordance with a Logisticcurve. Numerical results also show that the algorithm performance is highly related tothe ratio of minority class and attribute number. When the ratio is less than3, theperformance will be greatly hindered. In addition, experimental results show that theensemble algorithm could improve the classification performance efficiently.(2) An intelligent diagnosis model for sewage treatment state monitoring isproposed based on principal component analysis(PCA) and bagging ensemblealgorithm, taking water quality parameters of monitoring stations as the researchobject. First of all, using factor analysis and rotation, the industrial parameters arequantitatively analyzed on each unit of wastewater treatment process. Then, therelationship of multi-sensor fusion of different sources during the process of thesewage treatment is obtained. Moreover, the essence of the sewage treatment processto characterize the change of water quality is revealed. The experimental results showthat original parameters information is represented reaching81.65%with PCA and theoriginal38attribute numbers is decreased to10. It overcomes the influence of noiseand redundancy of data. On the basis of PCA preprocessing, classical BPNN, kNNand Na ve Bayes algorithms are used as comparision algorithms. Experimental resultson the monitoring state of each unit in the sewage treatment show that the new modelcan dynamically make prediction of fault type in the sewage treatment processing,due to the classifier training time is greatly reduced. In view of the recognition rate ofminority class in real data is low, a bagging integration classifier algorithm based onPCA preprocessing (PCA-Bagging) is proposed. Comparing with the classical BPNN,kNN and Na ve Bayes, the average recognition rates are improved5.30%,19.87% and9.27%respectively. From the view of the minority class C2, the former is higher20.00%,48.57%and5.71%than the latter algorithms respectively; In the aspect ofperformances of the minority class C3, the former is higher20.00%,65.00%and10.00%than the latter ones, respectively. The experimental results show that thePCA-Bagging is superior to classical BPNN, kNN and Na ve Bayes classifier.(3) A balanced support vector back propagation (BSV-BP) neural network isdevised focused on the microbial data of wastewater treatment activated sludge. Firstof all, combining the information entropy analysis of microbes in activated sludgewith the knowledge of domain experts, eight microorganisms of activated sludgequality evaluation objects are identified. Secondly, combining the k-means analysis ofmicrobes in activated sludge with the knowledge of domain experts, the quality ofsludge biological activity is devided into four classes. Through the analysis on twoyears data of a sewage treatment plant, it can be found that there is a seriousimbalance problem in quality levels of sludge biological activity data. To solve theproblem of imbalance data distribution, we firstly use the support vector machine(SVM) to generate new balanced training sets. Then, BPNN is employed forclassification. In order to verify the validity of the model, the area under thecurve(AUC) method is used to evaluate the algorithms. The simulation results showthat the BSV-BP algorithm not only can effectively remove the informationredundancy and noise, but reduce the training time of classifier. The AUC of BSV-BPis6.9%higher than the classical BPNN. Its individual class accuracy and overallrecognition rate are far better than BPNN and SVM algorithm. It can be found that thenew algorithm can promote the accuracy of classification on the quality of theactivated sludge, discover the emergency of sewage treatment in time, and improvethe accuracy of sludge back flow and residual sludge emissions. In other words, itachieves the goal of saving energy and reducing consumption.(4) Water quality assessment model is a effective tool of quality planning, waterenvironmental pollution control and environmental management. An evolutionarysupport vector machine model is established by introducing genetic algorithms(GA) to optimize the radial basis kernel function parameter and an error penalty factor C forthe support vector machine classification algorithm.To verify the effectiveness of themodel, it is applied to the real data of the Songyuan and Harbin sub-region ofSonghua River, and the Gansu sub-region of Yellow River. The results show that theevolutionary support vector machine model of water quality assessment improves inthe classification accuracy and generalization ability, compared with the classicalSVM method. The experimental results show that algorithm achieve satisfied results.
Keywords/Search Tags:Imbalanced data, Ensemble Learning, Activated sludge classification, Waterquality assessment model
PDF Full Text Request
Related items