Font Size: a A A

Research Of Machine Learning On Imbalanced Data Sets And Its Application In Geosciences Data Processing

Posted on:2010-10-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q GuFull Text:PDF
GTID:1100360275976890Subject:Geographic Information System
Abstract/Summary:PDF Full Text Request
Classification is an important mission of data mining and knowledge discovery in databases. Conventional machine learning classification technologies assumed that, maximizing whole accuracy is the goal of classification, the classifier operates on data drawn from the same distribution as the training data, and the misclassification at any situation brings the same error costs. Based on such assumptions, large amounts of classification algorithms, such as decision tree, Bayesian Classification, artificial neural network, K-nearest neighbor, support vector machines, genetic algorithm, and the newly reported classification algorithms, have been developed and successfully applied to many fields, such as medical diagnoses, information retrieval, text classification, and etc. However, the assumptions always failed to deal with the imbalanced data sets (IDS) in real problems, where one class might be concentrated in a large number of samples and the other classes own very few. Most classification algorithms pursue to minimize the error rate by ignoring the differences between types of misclassification errors cost and consequently yield poor predictive accuracy over the minority class. The major difficulties of IDS classification lie on the feature of the data sets themselves (lack of absolute/relative data of the minority class, data fragmentation, noise, etc.) and the limitations of conventional classification algorithms (improper evaluation metrics and inappropriate inductive bias). Consequently, classification on IDS becomes a hot topic of machine learning and pattern recognition, and it presents a great challenge for conventional classification algorithms.In the last decades, many efforts have been performed to improve the classification performance towards the minority class. Two general approaches are currently available to tackle the imbalanced data classification problems. One approach is based on data level, known as data set reconstruction or re-sampling. By using under-sampling of the majority class or over-sampling of the minority class or combining both of the two techniques to reduce the degree of class distribution imbalance, the classification performance towards the minority class can be improved in a certain extent. Another approach is based on algorithms level aiming to modify the existing data mining algorithms or develop new ones such as Cost Sensitive Learning (CSL), Support Vector Machine, One-Class Classification, and ensemble learning methods. Through revising of cost factor, setting different weights according to specific samples, changing probability density function and adjusting the decision border, one can also improve the classification performance towards the minority class. However, although improvements are achieved, problems such as loss of important information of majority class and over-fitting when dealing with IDS still await to solve, which will decrease the reliability of predicted results. Therefore, under the condition of reserving the whole classification performance, how to improve the performance towards the minority class samples and consequently to attain accurate predictions according to the classification results is still a topic that well worth further studying.Centering on this topic and starting from the three basic assumptions, we present deep and systematic investigation of developing of several novel algorithms and reliability validation of these algorithms for IDS in this thesis. As a first step, the assessment methods and evaluation measures of the classification performance were thoroughly discussed. Then we proposed two vital amelioration of re-sampling of IDS based on the existing SMOTE over-sampling algorithms at the data level, and these techniques were applied to preprocess of geosciences data sets to validate their reliability; at the algorithm level, we combined the re-sampling technique and CSL technique based the minimal total misclassification cost together to achieve better classification performance. The main efforts and conclusions of this thesis are listed below:1. Classification Performance Evaluation and Algorithm Development of IDSA) Assessment methods and evaluation measures of the classification performance of IDSWhether a high whole accuracy can serve as the evaluation measure of IDS classification or not was discussed firstly. Assessment methods and evaluation measures of classification performance play a critical role in guiding the design of classifiers. There are many assessment methods and evaluation measures each have its individual advantages and disadvantages. Thus the modification of classification algorithms in some extent equals the improvement of criterions. Many efforts have been conducted to design/develop more advanced algorithms to solve the classification problems. In fact, the assessment methods and evaluation measures are at least as important as algorithm and is the first key stage to a successful data mining. We systematically summarized the typical classification technologies, the general classification algorithms, the assessment methods and evaluation measures of IDS. Several different type performance measures, such as numerical value measure and visualizing classifier performance measure, have been analyzed and compared. The problems of these technologies and measures towards IDS may lead to misunderstanding of classification results and even wrong strategy decision. Beside that, a series of complex numerical evaluation measures were also investigated which can also serve for evaluating classification performance of IDS.In general, there is no a generalized evaluation measure for various kind of classification problems. A good strategy to identify a proper evaluation measure should largely depend upon specific application requirement. Choose appropriate evaluation measure according to different background can help people make correct judgment to the algorithm classification performance.b) Resampling algorithm of IDSWe proposed two new hybrid re-sampling techniques based on the improved SMOTE over-sampling algorithm. By combining the over-sampling technology and the under-sampling technology together, the IDS evolve to balance before classification.The first technique is the automated adaptive selection of the number of nearest neighbors of hybrid re-sampling method. In the SMOTE method, blindfold new synthetic minority class examples by randomly interpolating pairs of closest neighbors were added into the minority class; and data sets with nominal features cannot be handled. In our procedure of over-sampling, these two problems were solved by the automated adaptive selection of nearest neighbors and adjusting the neighbor selective strategy. As a consequence, the quality of the new samples can be well controlled. In the procedure of under-sampling, by using the improved under-sampling technique of neighborhood cleaning rule, borderline majority class examples and the noisy or redundant data were removed. This method in fact combined the improved SMOTE and the NCR data cleaning methods. The main motivation behind these methods is not only to balance the training data, but also to remove noisy examples lying on the wrong side of the decision border. The removal of noisy examples might aid in finding better-defined class clusters, therefore, allowing the creation of simpler models with better generalization capabilities. and therefore, promising effective processing of IDS and a considerably enhanced classifier performance.The second technique is the Isomap-based hybrid re-sampling method. The method attempts to reduce the degree of imbalanced class distributions through combining the Isomap nonlinear dimensionality reduction method with the hybrid re-sampling technology. We first analyzed two methods for the most general linear (principal component analysis and multidimensional scaling) and nonlinear (Isometric feature mapping and Locally Linear Embedding) dimensionality reduction algorithms. These two technologies were sequentially utilized to preprocess geosciences data and to reduce the dimensionality of the feature space. The structure of classification model was thus simplified and the whole classification performance was highly improved. SMOTE is an approach by over-sampling the minority class. However, it is limited to a strict assumption that the local space between any two minority class instances is minority class instance or belongs to the minority class, which may not be always true in the case when the training data is not linearly separable. We present a new re-sampling technique based on Isomap. The Isomap algorithm is first applied to map the high-dimensional data into a low-dimensional space, where the input data is more separable, and thus can be over-sampled by SMOTE. The over-sampled samples were then under-sampled through NCR method yielding balanced low-dimensional data sets. By such a procedure, the evaluation measures were sequentially promoted and the classification performance is considerably improved, especially the F-measure of minority class. In fact, both the whole and the minority class classification performance were improved simultaneously. The underlying re-sampling algorithm is implemented by incorporating the Isomap technique into the hybrid SMOTE and NCR algorithm. Experimental results demonstrate that the Isomap-based hybrid re-sampling algorithm attains a performance superior to that of the re-sampling. It is clear that the Isomap method is an effective mean of reducing the dimension of the re-sampling, which provides a new possible solution for dealing with the IDS classification.c) CSL algorithm of IDSWe first discussed the misclassification cost problems centering on the third assumption of conventional machine learning. Most studies focused on the IDS classification or cost-sensitive learning systems themselves; however, the fact that imbalanced class distribution and misclassification errors cost unequally always occurring simultaneously was neglected. We attempted to combine the re-sampling and the CSL techniques together in order to solve the misclassification of IDS. On one aspect, the re-sampling technique allows balanced data sets by reconstructing both the majority and the minority class. On the other aspect, the classification was performed based on minimal misclassification cost but not the maximal accuracy. Here the misclassification cost for the minority class is much higher than the misclassification cost for the majority class. Cost-sensitive learning procedure was then conducted for classification. Using appropriate cost factor and balancing the data sets through re-sampling technology, our CSL algorithm based on the minimal misclassification cost performs much better than the currently available classification techniques. Not only is the classification performance of minority class improved significantly, but the overall classification performance is enhanced in a certain extent.2. Application and Analysis of Our Classification Algorithm of IDS in GeosciencesThe automated adaptive selection of the number of nearest neighbors of re-sampling method was applied to study the fatalness prediction engineering of rockburst. The statistic data of large amounts of rockburst is a kind of typical IDS. It is very difficult to give an accurate prediction using conventional classification methods. In fact, we mostly concern the minority class other than the majority class and high prediction accuracy is always desired. In this thesis, the VCR rockburst database provided the Academy of South Africa was employed as a sample IDS for classification and prediction. By adding extra artificial minority class samples as the expanded training set. experimental simulation was performed, which yields exactly consistent prediction results with the actual situation. Promisingly, the re-sampling method and classification scheme we developed is feasible and reasonable for applications of IDS from engineering. It is unnecessary to build complicate mathematic equation or computer models for our algorithms and the input data sets can be easily measured or obtained, thus this method can be readily implemented to determine the controlling factors of engineering. Such a prediction can provide reasonable and sufficient guidance to design a safe construction scheme of in deep mining engineering.The major innovation and contribution of this thesis are listed as follows:a) We developed two types of hybrid re-sampling algorithms. Aiming to the problems and improper assumptions of SMOTE algorithm, we proposed the automated adaptive selection of the number of nearest neighbors of hybrid re-sampling algorithm and the Isomap-based hybrid re-sampling algorithm, respectively. Both the two algorithms can effectively deal with the IDS classification.b) We proposed a novel CSL algorithm on IDS. Aiming the fact that imbalanced class distribution and misclassification errors cost unequally always occurring simultaneously. We proposed combined methods of the re-sampling and the CSL techniques together in order to solve the misclassification problem of IDS. The combination algorithm intergrades the advantanges together and thus can perform much better than existing methods.c) We introduced the IDS processing methods to analyze the geoscience data. Due to the characteristics such as uncertaincy, empiricism, oblique, incomplete and imbalanced class distribution of geoscience data, we employed the dimensionality reduction method to preprocess the data firstly and then utilized the effective classification methods towards IDS to virtually process huge amount of geoscience data. Such a analytical scheme would be very powerful for the automatic and intellegent analysis of geoscience data.
Keywords/Search Tags:Imbalanced data sets, Re-sampling, Cost Sensitive Learning, Machine Learning, Classification
PDF Full Text Request
Related items