Study On Imbalanced Data Sets Classi-fication Method And Its Application In Telecommunication

Posted on:2012-11-18

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Wang

Full Text:PDF

GTID:2218330371457783

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

In recent years, classification of data with imbalanced class distribution is a difficult task in data mining and machine learning. The imbalanced data sets classification is the pattern classification of data with imbalanced class distribution. In many real-world problems, the data sets are imbalanced, that is, some classes have much less instances than others. When using traditional machine learning algorithms to solve the problem, the prediction accuracy rate of the minority class is significantly lower than that of the majority class, leading to significant decline in classification performance. In order to resolve the imbalanced problems, especially those of the poor predictive accuracy over the minority class, a new approach, AdaBoost-SVM-OBMS, which is based on a combination of boosting and a new over-sampling method using misclassified samples to generate new samples, is proposed in this paper. Aiming at insolvency mining, which is one of the most commonly used subjects of telecom data mining, based on deep research on imbalanced data sets classification and the telecom data sets, this thesis conducts a research on telecom data mining. The main research work is listed as below.1. Aiming at the imbalanced data sets classification, AdaBoost-SVM-OBMS is proposed. It is a new approach that combines boosting, an ensemble-based learning algorithm, with a new over-sampling method based on misclassified points. In this approach, the misclassified points are identified during each iteration with Support Vector Machine as the base classifier. Subsequently, they are used to separately generate new examples for the majority and minority classes. The new examples are then added to the original training set to retrain the classification model. The new approach was evaluated, in terms of the AUC, F-value, and G-mean, on eight benchmark imbalanced data sets. Results indicate that the new approach produces high predictions against both minority and majority classes.2. Based on telecom insolvency data mining's own characteristics and the experiences of telecom experts, a new concrete strategy is proposed to solve the telecom insolvency problem, The result of the experiment shows that this strategy enables telecom data mining feasible.

Keywords/Search Tags:

Telecom, Data Mining, Telecom Insolvency, Imbalanced Data Sets, Large-scale Data, Support Vector Machine, Boosting

PDF Full Text Request

Related items

1	Study On Data Quality Assessment Techniques For Telecom Data Mining
2	Research On Support Vector Machine For Large Scale Imbalanced Data
3	Research On Classification Algorithms Of Data Mining Based On Imbalanced Data Sets
4	Research On Classification Algorithm For Imbalanced Data Sets Based On Support Vector Machines
5	Research And Application Of The Support Vector Machine On Large-scale Datas
6	Research And Applications Of Classification Algorithms In Imbalanced Data Sets
7	The Research Of Remote Fault Diagnosis Based On Imbalanced Data Mining
8	Telecom Ip-based Data Mining Techniques Of Decision Support System Design And Implementation
9	The Classification Of Imbalanced Large Data Sets Based On Map Reduce
10	Product Development Based On Large Data Mining Technology