Research And Applications Of Classification Algorithms In Imbalanced Data Sets

Posted on:2009-06-10

Degree:Master

Type:Thesis

Country:China

Candidate:J W Gao

Full Text:PDF

GTID:2178360272963571

Subject:Systems Engineering

Abstract/Summary:

PDF Full Text Request

The imbalanced data sets, which are widely used in the world, mean that the amount of some classes is less than that of other classes in the same sets. If traditional classification algorithms are used to classify the imbalanced data sets, the accuracy of the smaller class is lower than that of the larger class. But the classes which have smaller amount are the focus in the imbalanced data sets, so traditional classification algorithms are finite to them. Recently, many home and overseas researchers have paid more attention to it. Furthermore, scientific achievements of it have been widely applied in related domain.In the paper, under the frame of KAIG, the classification and knowledge acquiring in imbalanced data sets is further studied based on the view of information granules. And the following results are obtained.(1) KAIG algorithm is improved with the introduction of the parameter Purity , which is employed to measure the overlaps between information granules. Experiments show that it is useful to measure the overlaps and to judge whether the overlaps are acceptable when it is unable to completely eliminate them. It is new measurement in the KAIG algorithm. When the value of attributes is sequential, firstly transform into discrete datum and then sub-attributes are used to reduce the overlaps, the parameter Purity determines whether to modify the bound of sub-attributes. Though the overlaps between information granules are not absolutely removed, the degree of overlaps is reduced. It is helpful to extract .rules from the sequential datum. Experiments demonstrate that the improved KAIG algorithm is not only approaching to traditional classification algorithms in balanced data sets, but also better than them in imbalanced data sets, especially is better than the KAIG algorithm when the value of attributes is sequential.(2) The improved KAIG algorithm is used to construct the telecom customers churn predictable model, as it is a typical imbalanced data set. The telephone datum between April and July of 2007, which derive from one telecom company in some city of Shanxi Province, are regarded as trained sets to acquire rules. The improved KAIG algorithm is utilized to predict the customers churn in August of 2007, compared with C5.0 and Logistic algorithms. ROC curve is firstly introduced to measure the prediction accuracy of the telecom customers churn.We have done some studies on the classification of the imbalanced data sets and the prediction of telecom customers churn. But it remains to be studied how to effectively classify the imbalanced data sets with categorical attributes or mixed attributes and how to bring some conditions into the telecom customers churn predictable model, such as analysis of competitors, quality of service, etc. Our work is just a beginning, and related work needs to be further developed.

Keywords/Search Tags:

Imbalanced data sets, Classification, ROC curve, Information granules, KAIG, Fuzzy ART, Telecom customers churn

PDF Full Text Request

Related items

1	Research On Classification Of Imbalanced Telecom Customer Data
2	Study On Imbalanced Data Sets Classi-fication Method And Its Application In Telecommunication
3	Research And Application Of Telecom Customer Churn Prediction Based On Fuzzy Bayesian Network
4	Research On Information Extraction Based On Prediction And Classification Model Of Lmbalanced Data Sets
5	The Application Of Data Mining In Telecom Customers Churn Control
6	A Research On Bagging Of XGBoost Classifiers For Prediction Churn In Telecom
7	Research On The Classification Of Imbalanced Data Sets Based On R-SMOTE
8	Research On The Classification Of Imbalanced Data Sets And Related Problems
9	Text Classification Algorithm Based On Imbalanced Data Sets
10	Researches On Fuzzy Clustering Methods Based On Information Granules