Font Size: a A A

Classification On Imbalanced Data Based On Immune Systems

Posted on:2017-02-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:X S AiFull Text:PDF
GTID:1108330488961977Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of cloud computing and mobile technology, Internet comes into the big data era. Faced with the rapid expansion of multimedia information, people call for the need for effective content management and rapid information search. Classification methods, which build classifer on the training set to classify and label unseen data, is now widely used in computer vision, character recognition, voice recognition, document classification, and other fields. The classification algorithms based on tags has gradually matured, including Naive Bayes, logistic regression, support vector machines, decision trees, and so on. However, these algorithms are dependent on the size of the training set. According to learning theory, error rate below the critical point only when the sample size exceeds a predetermined lower bound. At the same time, a large number of the imbalanced datasets exist in people’s real life. As minority misclassification produced at great cost(eg lower the risk of cancer but important), people are more concerned about minority class examples. To resolve this conflict, this article provides methods for classification on the imbalanced data based on immue system. Based on theories and principles of human immune system, it covers binary-class imbalance, multi-class imbalance, lack of density issue and intra-class imbalance. Main work and contributions are as follows:(1) Research on theories and methods which uses immune centroids improves classification performance on the binary-class imbalanced dataset. For binary-class classification, there are more training examples of one class(namely, positive class or majority class) than those of the other class(namely, negative class or minority class). Under such circumstances, standard classification algorithms tend to favor the majority class. As a result, misclassification rate of the minority class is significantly greater than that of majority class. Based on immune network theory, Immune Centroids Over-Sampling Technique(ICOTE) is proposed. Our method generates a set of immune centroids to broaden the decision regions of the minority class space after maturation, mutation, suppression process. The representative immune centroids are regarded as synthetic examples for resolving the imbalance problem. The shape of the spatial distribution of the immune centroids follows that of the minority class examples, ICOTE not only duplicates copies’ overfitting, but also makes synthetic examples reflect spatial distribution of the original examples.(2) Research on theories and methods which construct multiple immune sub-networks to improve classification performance on multi-class imbalanced dataset. Learning from multi-class datasets faced many challenges such as increased search space, higher algorithm complexity, overlapping bourndaries. Existing solutions proposed for binary classification problems may not be directly applicable. At the same time, as there is more than one minrity class, imbalances class space overlapping become more common and prominent. When traditional learning algorithms ignore these characteristics, they tend to decrease misclassification rate of the majority class. Base on immune network theory, Global Immune Centroids Over-Sampling(Global-IC) generates self antibodies of the minority class exampes to balance class distribution, which make the resulting classifier assigning each class the same weight.(3) Research on theories and methods which use Negative Selection mechanism to improve classification performance on the sparse minority class space. Compared with the majority class space, the number of examples of the minority class space is small and data distriubtion renders sparse so there are more outliers or small disjucts in the minority class space. Based on negative selection mechanism of the human immune system and density-based outlier detection, Negative Selection Over-Sampling Technique(NSOTE) generates detectors of the majority examples to increase data density of the minority class space. NSOTE learns on spatical distribution of the entire dataset and the resulting detectors follows density distribution of the monirity class space. Because NSOTE learns sample data as much as possible and generates denser decision-making area(decision region) in the minority class space, decision tree learning algorithms have enough information to build a classifier which more precisely predicts unseen examples.(4) Research on theories and methods which use immune mechanisms to address inter-class and inter-class imbalance. Imbalance problem is not simply an imbalance between classes, but there are "small disjucts" within a class. Two factors both affects prediction accuracy of the resulting classifier. Based on immune network theory and cluster analysis, shaped-based oversampling(SBO) is proposed. SBO uses clustering algorithm to discriminate subclusters in a class and then over-sample small disjuncts. On one hand, SBO reduces CURE’s argument dependency by generateing the representatives of the minority class examples which is based on immune network; on the other hand, SBO discriminates fakes clusters and generates immunological representatives representing cluster architecture. As immunological respresentatives are not copies of the original examples, overfitting is also alleviated. After the class and the cluster distrubtion is well balanced, the expanded dataset shares similar spatial distribution with and the original dataset. Consequently, classification algorithms do not favor the majority class and the resulting decision tree can correctly predict unseen examples.
Keywords/Search Tags:Classification, Imbalanced Data, Immune System, Resampling, Negative Selection
PDF Full Text Request
Related items