Font Size: a A A

Research Of Classification Algorithms For Imbalance Data And Its Application To ID Matching Problem

Posted on:2017-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:L R GuFull Text:PDF
GTID:2308330482989807Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Study on the classification algorithm is one of the most significant research directions of data mining. Generally speaking, one classification algorithm is often conducted on the dataset consisting of different classes with same or similar scales. Naturally, the error cost function of each class is the same. And the final classification accuracy is took as the evaluation criteria. Some typical classification algorithms, for example, the decision making tree(ID3, C4.5), the support vector machine(SVM), genetic algorithm(GA), k-nearest neighbour(KNN) and so on, have been employed in various applications. However, in many classification problems of the real life, different classes often have great different size of samples, or even in different level scales. Those data are called by imbalanced data. In this case, the traditional classifier model is inclined to put all samples into the bigger classes and ignore the smaller ones. Although the whole accuracy is high by using the traditional classifier models, the accuracies of smaller classes are often low. However, the classification results of the major classes are often the focus of our concern. Therefore, it has become one of the hot topics to study the classification of imbalanced data sets in the machine learning field. In the industrial sector, we need to do ID Matching in order to get user accurate behaviour attributes. According to the frequency, two random ID belongs to a same user behaviour is very rare among the massive data of ID matching. That is to say, ID matching problem is a typical imbalanced classification problem. When using traditional machine learning methods to match those massive ID pairs, the performance are often not very good.Researches on this topic, at present, are mainly conducted from two directions, one is algorithm, and the other is data. As for algorithm, it mainly includes calculating the cost functions of different kinds of samples with concentrated data, adjusting their weight in the categorization model, so that we can further adjust the classifier’s sorting bounders and make it more applicable for the sample classification occasions for which we concern mostly. As for data, the main work is to deal with the imbalanced data, for example, using under-sampling technology to reduce its numbers for the major category samples while employing over-sampling technology to increase its numbers for the minor category samples. By this way, the distribution of samples for datasets will be more likely to the distribution principles of the traditional balanced datasets, so that the typical classification algorithms can be adapted to do the simulated training and categorization works for the data collections that have been processed.Focusing on the study of the imbalanced classification problems, this thesis proposes an improved oversampling algorithm based on SMOTE and an under-sampling algorithm based on density and integrate the two algorithm as well. Firstly, the traditional SMOTE algorithm mainly uses several points nearby one certain point by taking a point randomly on the wire between this particular point and every nearby point as a new point. Whereas the improved SMOTE algorithm is to figure out the nearest points of a particular at the total number of K and then work out the geometric center point between them as the new obtained point. Compared with the traditional one, the edge of this improved algorithm is that every new point emerged is derived from more than two original concentrated points, so that more information of the original centralized data can be adapted and as a result, the new emerging points will be more practical. Secondly, the author puts forward the under-sampling algorithm based on density. The main idea is first to identify the positive and negative ratio, and then calculate the distance between all negative samples. So the density peak of each point is obtained according to the distance. Then, we can select the appropriate number of points with higher peak densities as the reduced majority class samples.We achieved a better results in the UCI for six imbalanced dataset with improved SMOTE algorithm and under sampling based on density. A large number of imbalanced datasets are produced in ID matching problem. By implementing the density based over sampling algorithm and improved SMOTE under sampling algorithm, the datasets imbalance has been improved significantly. And then, we apply the traditional classification algorithm on the improved dataset, achieve better results.
Keywords/Search Tags:Imbalanced data collection, re-sampling, machine study, classification
PDF Full Text Request
Related items