Font Size: a A A

Research And Applications Of Classification Algorithms In Imbalanced Data Sets

Posted on:2009-06-10Degree:MasterType:Thesis
Country:ChinaCandidate:J W GaoFull Text:PDF
GTID:2178360272963571Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
The imbalanced data sets, which are widely used in the world, mean that the amount of some classes is less than that of other classes in the same sets. If traditional classification algorithms are used to classify the imbalanced data sets, the accuracy of the smaller class is lower than that of the larger class. But the classes which have smaller amount are the focus in the imbalanced data sets, so traditional classification algorithms are finite to them. Recently, many home and overseas researchers have paid more attention to it. Furthermore, scientific achievements of it have been widely applied in related domain.In the paper, under the frame of KAIG, the classification and knowledge acquiring in imbalanced data sets is further studied based on the view of information granules. And the following results are obtained.(1) KAIG algorithm is improved with the introduction of the parameter Purity , which is employed to measure the overlaps between information granules. Experiments show that it is useful to measure the overlaps and to judge whether the overlaps are acceptable when it is unable to completely eliminate them. It is new measurement in the KAIG algorithm. When the value of attributes is sequential, firstly transform into discrete datum and then sub-attributes are used to reduce the overlaps, the parameter Purity determines whether to modify the bound of sub-attributes. Though the overlaps between information granules are not absolutely removed, the degree of overlaps is reduced. It is helpful to extract .rules from the sequential datum. Experiments demonstrate that the improved KAIG algorithm is not only approaching to traditional classification algorithms in balanced data sets, but also better than them in imbalanced data sets, especially is better than the KAIG algorithm when the value of attributes is sequential.(2) The improved KAIG algorithm is used to construct the telecom customers churn predictable model, as it is a typical imbalanced data set. The telephone datum between April and July of 2007, which derive from one telecom company in some city of Shanxi Province, are regarded as trained sets to acquire rules. The improved KAIG algorithm is utilized to predict the customers churn in August of 2007, compared with C5.0 and Logistic algorithms. ROC curve is firstly introduced to measure the prediction accuracy of the telecom customers churn.We have done some studies on the classification of the imbalanced data sets and the prediction of telecom customers churn. But it remains to be studied how to effectively classify the imbalanced data sets with categorical attributes or mixed attributes and how to bring some conditions into the telecom customers churn predictable model, such as analysis of competitors, quality of service, etc. Our work is just a beginning, and related work needs to be further developed.
Keywords/Search Tags:Imbalanced data sets, Classification, ROC curve, Information granules, KAIG, Fuzzy ART, Telecom customers churn
PDF Full Text Request
Related items