Font Size: a A A

The Research Of Imbalanced Data Classification

Posted on:2015-02-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:P CaoFull Text:PDF
GTID:1318330482455722Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the progress of science, particularly the rapid development of information technology, there are so much amount of data with different forms produced in different domain. However, the information hidden in the data is still necessitous. In order to fix the problem of obtaining the useful information in the massive data, the technology of data mining is raised. Nowadays, the methods of data mining are widely used in many fields of commercial, industrial, and scientific research area, but people discover and pay attention to a very challenging problem-the classification of imbalance data. The problem of classification on the imbalanced data inevitably lead the traditional classifier with overall accuracy as learning objective to focus on the majority class, resulting in a low classification performance of minority class.This dissertation focuses on the filed of classification techniques in mining imbalanced data streams. The research areas are based on in-depth exploration of various strategies in imbalanced data learning, mainly from the approaches of re-sampling, cost sensitive learning and ensemble classifier, and involve the learning for stastic data as well as data stream, the classification of binary class as well as multiple classes, and the recognition of nodule in lung CAD. The contributions of this dissertation are as follows:(1) In order to solve the issue that the within-class imbalance, class overlapping and noisy data negatively influence the traditional classifier and re-sampling, a hybrid sampling algorithm based on probability distribution estimation is proposed. The approach re-samples the data of subclass to balance the distribution in each class based on probability distribution estimation, and solve the imbalance issues from both the global and local perspectives simultaneously, so as to make the re-sampled data more approximatly fit the true data distribution. Additionally, the issues of class overlapping and noise in imbalanced data are considered at the same time. Experimental results demonstrate that our algorithm is effective in enhancing the quality of data distribution for imbalanced data, and improve the classification overall performance of common classifiers.(2) In order to solve the issue that the probability output of each instance generated by generative models is not consistent with the degree of membership class in the context of imbalanced data, an ensemble classifier combined with the optimization of the parameters in the decision decision is proposed. Using the imbalanced data evaluation metric as the objective function, the method optimizes decision criteria based on the posterior probabilities; moreover, to improve the generalization ability of classification on the imbalanced data, we design an adaptive random subspace ensemble classifier, which enhances the diversity between base classifiers with avoiding overfitting of learning and optimizing. Furthermore it can obtain the optimal amount of classifiers automatically. Experimental results demonstrate that the proposed method has a better advantage for imbalanced data learning in terms of accuracy and efficiency through a large number of UCI datasets.(3) In order to solve the issue that both the existing re-sampling and cost sensitive learning lack effective guidance and optimization, a measure oriented optimization wrapper algorithm for learning imbalanced data is proposed. It can obtain the optimal data distribution or cost sensitive learning model through optimizing factors of imbalanced learning as well as the feature subset simultaneously with the evaluation measure as the objective function by PSO for binary class or multiple classes. A large number of UCI datasets are used to comprehensively test the wrapper algorithm under the comparison with the state-of-the-art methods. Experimental results showed that the wrapper method proposed has substantial advantages over other methods.(4) In order to improve the classification of the imbalanced data stream with concept drift, a weighted ensemble classification method combined with selective over-sampling technique is introduced. The algorithm increases the amount of instances and expands the decision region of minority class by selecting previous similar instances and synthetizing new borderline instances; meanwhile, to adapt the concept drift in data stream, an ensemble learning with weighting strategy based on the probability distributions relevance is proposed, so as to improve the overall classification performance. The experimental results show that the proposed algorithm improves the accuracy of the minority class and the overall performance effectively, demonstrating that our method has a better advantage for imbalanced data stream learning.(5) In order to solve the issue that the nodule candidate data generated by initial detection in CAD are imalanced between true and false positives, rare positive instances and high dimensional, we improve and extend the algorithms proposed in Chapter 3-5 based on SVM, and apply them to the nodule candidates data set. Experiments show that the methods can eliminate as many false positives as possible while keeping a high sensitivity, hence it could identify 3-D lung nodules accurately.In conclusion, the proposed methods in this thesis are effective to solve the issues in classification of imbalanced data, and improve the accuracy of minority class and robustness.
Keywords/Search Tags:Data mining, Classification algorithm, Imbalanced data, Re-sampling, Cost sensitive learning, Ensemble classification, Lung nodule computer aided detection
PDF Full Text Request
Related items