Font Size: a A A

Research On Classification Based On Clustering For Imbalanced Dataset

Posted on:2014-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:X S ChenFull Text:PDF
GTID:2268330401981558Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The text classification is very important in data mining research, and widely usedin real life. In numerous data, some have such characteristics as follews. Althoughthese data are only a small part, they play a very important role in the total data. Whena dataset contains different categories and the numbers of elements in differentcategories are different, This kind of dataset is called imbalanced dataset. It is difficultto correctly classify important minority class data with traditional classificationalgorithm on imbalanced datasets. Therefore, how to be classify the minority classdata becomes an important branch of data mining research.There are two commonly used methods to solve the imbalanced datasetsclassification problems at present. The algorithm and the data both are the key factorsfor the classfication of imbalanced data. That means we should design and improvethe algorithm for classification, reduce the degree of imbalance by under samplingmajority class data or oversample the minority class data in order to reduce the degreeof imbalance. Of cause, the two methods can be combined for preprocessing the data.In this thesis, the research involves in the algorithm improvment and dataprocessing for the imbalanced datasets classification.(1) The thesis proposes the clustering (called JL-KNN) algorithm, namelyclustering the majority class data using K-means, sorting the data according to thedistance to the cluster center and under sampling the data, reducing the degree ofimbalance. Then, KNN algorithm is used for classifying the data. The experimentshows that JL-KNN algorithm improves the performance of minority class dataclassification.(2) The thesis proposes improved clustering KNN (GJL-KNN) algorithm, namelyimproving the category judge standard of samples in KNN classifier, using theminimum average distance to the same category as the judge standard, and applyingthis algorithm to imbalanced datasets classification. The experiment shows that theperformance of GJL-KNN algorithm is better than that of KNN algorithm whenclassifying the minority class datasets on imbalanced datasets classification.
Keywords/Search Tags:Classification, Imbalanced Datasets, K-means Algorithm, KNNAlgorithm, Evaluation Criterion
PDF Full Text Request
Related items