Font Size: a A A

Study On Imbalanced Data Sets Classi-fication Method And Its Application In Telecommunication

Posted on:2012-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:C Y WangFull Text:PDF
GTID:2218330371457783Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
In recent years, classification of data with imbalanced class distribution is a difficult task in data mining and machine learning. The imbalanced data sets classification is the pattern classification of data with imbalanced class distribution. In many real-world problems, the data sets are imbalanced, that is, some classes have much less instances than others. When using traditional machine learning algorithms to solve the problem, the prediction accuracy rate of the minority class is significantly lower than that of the majority class, leading to significant decline in classification performance. In order to resolve the imbalanced problems, especially those of the poor predictive accuracy over the minority class, a new approach, AdaBoost-SVM-OBMS, which is based on a combination of boosting and a new over-sampling method using misclassified samples to generate new samples, is proposed in this paper. Aiming at insolvency mining, which is one of the most commonly used subjects of telecom data mining, based on deep research on imbalanced data sets classification and the telecom data sets, this thesis conducts a research on telecom data mining. The main research work is listed as below.1. Aiming at the imbalanced data sets classification, AdaBoost-SVM-OBMS is proposed. It is a new approach that combines boosting, an ensemble-based learning algorithm, with a new over-sampling method based on misclassified points. In this approach, the misclassified points are identified during each iteration with Support Vector Machine as the base classifier. Subsequently, they are used to separately generate new examples for the majority and minority classes. The new examples are then added to the original training set to retrain the classification model. The new approach was evaluated, in terms of the AUC, F-value, and G-mean, on eight benchmark imbalanced data sets. Results indicate that the new approach produces high predictions against both minority and majority classes.2. Based on telecom insolvency data mining's own characteristics and the experiences of telecom experts, a new concrete strategy is proposed to solve the telecom insolvency problem, The result of the experiment shows that this strategy enables telecom data mining feasible.
Keywords/Search Tags:Telecom, Data Mining, Telecom Insolvency, Imbalanced Data Sets, Large-scale Data, Support Vector Machine, Boosting
PDF Full Text Request
Related items