Font Size: a A A

Imbalanced Data Classification And Its Application In The Prediction Of The Mobile Phone Replacement

Posted on:2017-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:B Y XiongFull Text:PDF
GTID:2348330533950159Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Imbalanced datasets exists widely in the real world, and its classification is a hot topic in data mining. The traditional classifiers take the overall prediction accuracy as the training target, which lead the prediction accuracy of the majority class is high and the prediction accuracy of the minority class is low. However, in practical applications such as prediction of replacing phone, classification accuracy for minority class is more significant. Therefore, the problem of how to change the data distribution of dataset and improve the classification accuracy of minority classes while maintaining the overall classification performance needs to be solved. There are two commonly used methods to solve the imbalanced datasets classification problems at present. One is to design new algorithms or modify learning algorithms to adapt imbalanced datasets, the other manner is to preprocess datasets in which it weakens the imbalance of original datasets using altering the distribution of training sampling.In this thesis, the research involves in the algorithm improvement and data processing for the imbalanced datasets classification.1. The thesis proposes hierarchical cost sensitive decision tree algorithm. The algorithm realized attributes reduction and calculated the importance of attributes by rough set, then a hierarchical structure was built by parting the attributes; finally a cost sensitive decision tree was regarded as the base classifier for the hierarchical structure, the decision tree was constructed with its splitting criterion which included gini index and misclassification cost. The experiments show that this algorithm not only can deal with the original dataset directly which ensure the integrity of the information, but also can handle the balanced dataset after under-sampling effectively, which reduce the scale of the problem, and this algorithm has good stability in a certain degree of imbalance.2. The thesis proposes an under-sampling method based on sample weight for imbalance problem. In this method, sample weight is introduced to reveal the area where the sample located at. Firstly, according to the sample scale, a weight is made for each sample and is modified after clustering the data set. The samples which in the center of majority class have less weight. Then some samples are drawn from majority class in accordance with the sample weight. In the procedure, the samples in the center of majority class can be selected easily. The sampled majority class samples and all theminority class samples are combined as the training data set for a component classifier.After that, we can get several decision tree sub-classifiers. Finally, the prediction model is constructed based on the accuracy of each sub-classifiers. The experiments show that this method can make the selected samples form the imbalanced have more representativeness.Based on that, this method can improve the the classification performance of minority class and reduce the scale of the dataset.
Keywords/Search Tags:imbalanced dataset, under-sampling, ensemble learning, cost sensitive, decision tree
PDF Full Text Request
Related items