Font Size: a A A

Research Of Ensemble Learning For High-dimensional And Imbalanced Data Classification

Posted on:2013-11-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:H YinFull Text:PDF
GTID:1228330392464623Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining faces the challenges from all kinds of data problems. Different datacharacteristics increase the complexity of algorithm. Among of them, classifying the datawith high-dimensional and imbalanced characteristics is the focus in these years. Currentapproaches only consider one aspect of high dimension and imbalance, but amount of realdata illustrate double characteristics. When classifying such data, current approaches face theperformance bottlenecks. How to effectively classify high dimensional and imbalanced datais an emergent problem in the applications.There are two methods to classify high dimensional and imbalanced data: preprocessing(feature selection and sample) then classifying and classifying directly. Data afterpreprocessing can be used by the traditional classification algorithms, but it loses parts offeature and instance information and preprocessing results will influence classificationperformance. Classifying directly keeps all the data information, but the classificationalgorithm must consider both characteristics of high dimension and imbalance, whichincreases the design complexity. The thesis does some researches from the two aspects. Whenpreprocessing high dimensional and imbalanced data, feature selecting firstly or samplingfirstly? We compare two situations (feature selecting first and sampling first) and obtain sucha conclusion: the classification performance is better when selecting feature firstly. Ifselecting feature firstly, feature selection algorithm faces imbalanced problem. To solving thisproblem, we promote an imbalanced data feature selection algorithm BRFVS. To solving theinformation loosing problem from preprocessing, we promote a cost-sensitive random forestalgorithm CSRF and a classification algorithm IEFS based on ensemble feature selection.The detailed works in this thesis are as below:1) Compare sequence effect of feature selection and sampling on classification performance.The experiment results in the special domain (software default detection) illustrate thatthe classification performance is better when selecting feature firstly then sampling.Because they only use software default detection data, the conclusion is not available toall. Meanwhile, the results in the some other domains show that it is not the key factor ofthe sequence of feature selection and sampling. But they introduce noise into data in theirexperiment. The conclusion is not fit for the situation without noise. The thesis choosestwelve datasets from UCI according to domain, feature number and the extent ofimbalance. It tests the preprocessing combination effect of filter and wrapper with sampling on classification performance. Different from above conclusion, the averageAUC performance, in the twelve dataset, is better, when selecting features firstly thensampling. This conclusion can provide guidance for the future application.2) Promote an imbalance feature selection algorithm BRFVS. Currently there are fewerresearches about feature selecting for imbalance data. EFSBS belongs to filter method,which doesn’t make full use of the feedback from classification algorithm. AlthoughPREE utilizes the feedback from classification, it cannot handle discrete feature. BRFVScan not only handle discrete feature, but also the continue feature. It is benefit fromRandom Forest variable selection. Firstly, it gets multiple balance datasets byoversampling, and then it computes feature importance measurements of each dataset byRandom Forest variable selection. Final measurement of feature is the weighted sum ofeach measurement. The weight is decided by the degree of agreement between theprediction of the tree and the final ensemble. It compares the effect of the different valueof random forest hyperparameter k to the final classification algorithm. The results showthat when k is M, the classification performance using BRFVS is better than otherfeature selection algorithm. Furthermore, it validates that feature selection first is better.3) Promote cost-sensitive random forest algorithm CSRF. Although classifying directlyavoids the effect of preprocessing, high dimensional data classification algorithm cannoteffectively classify imbalanced data, while the imbalanced data classification algorithmdoesn’t consider the situation of high dimension. CSRF introduces test cost andmisclassification cost into decision tree attribute split measurement in random forest.These two costs are separately related to positive data. CSRF adjusts the focus onpositive data to increase the right recognition rate. The experiment compares CSRF,random forest and random forest introducing misclassification cost. CSRF has theadvantage in AUC, especially the right recognition rate on positive data. Meanwhile, theclassification performance of CSRF is apparently higher than classifying afterpreprocessing.4) Promote high dimensional and imbalanced data classification algorithm based onensemble feature selection IEFS. Current evaluation function of ensemble featureselection algorithm only considers the weighted sum of diversity and accuracy, withoutconsidering the imbalance characteristics, which is not fit for imbalanced dataclassification. IEFS chooses Kohavi-Wolpert variance as the diversity measurement andintroduce award and penalty factors into it to add the focus of positive data. It uses thehill-climbing to search the solution spaces. IEFS can consider diversity, accuracy andimbalance. The experiment shows that its AUC performance is worse than CSRF, but apparently higher than C4.5and random forest.Although feature selection firstly faces the imbalance problem, whenever using BRFVSor other feature selection algorithm, the way preprocessing high dimension problem beforehandling imbalance problem can get better classification performance. Comparing withpreprocessing, classifying directly is better in AUC and positive data right recognition rate.But it costs too much time and is fit for off-line. IEFS is worse than CSRF for the limitationof searching method.
Keywords/Search Tags:High-Dimensional Data Classification, Imbalanced Data Classification, Random Forest, Cost-Sensitive, Ensemble Feature Selection
PDF Full Text Request
Related items