Research On Classification Based On Clustering For Imbalanced Dataset

Posted on:2014-01-09

Degree:Master

Type:Thesis

Country:China

Candidate:X S Chen

Full Text:PDF

GTID:2268330401981558

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The text classification is very important in data mining research, and widely usedin real life. In numerous data, some have such characteristics as follews. Althoughthese data are only a small part, they play a very important role in the total data. Whena dataset contains different categories and the numbers of elements in differentcategories are different, This kind of dataset is called imbalanced dataset. It is difficultto correctly classify important minority class data with traditional classificationalgorithm on imbalanced datasets. Therefore, how to be classify the minority classdata becomes an important branch of data mining research.There are two commonly used methods to solve the imbalanced datasetsclassification problems at present. The algorithm and the data both are the key factorsfor the classfication of imbalanced data. That means we should design and improvethe algorithm for classification, reduce the degree of imbalance by under samplingmajority class data or oversample the minority class data in order to reduce the degreeof imbalance. Of cause, the two methods can be combined for preprocessing the data.In this thesis, the research involves in the algorithm improvment and dataprocessing for the imbalanced datasets classification.(1) The thesis proposes the clustering (called JL-KNN) algorithm, namelyclustering the majority class data using K-means, sorting the data according to thedistance to the cluster center and under sampling the data, reducing the degree ofimbalance. Then, KNN algorithm is used for classifying the data. The experimentshows that JL-KNN algorithm improves the performance of minority class dataclassification.(2) The thesis proposes improved clustering KNN (GJL-KNN) algorithm, namelyimproving the category judge standard of samples in KNN classifier, using theminimum average distance to the same category as the judge standard, and applyingthis algorithm to imbalanced datasets classification. The experiment shows that theperformance of GJL-KNN algorithm is better than that of KNN algorithm whenclassifying the minority class datasets on imbalanced datasets classification.

Keywords/Search Tags:

Classification, Imbalanced Datasets, K-means Algorithm, KNNAlgorithm, Evaluation Criterion

PDF Full Text Request

Related items

1	Classification Algorithm And Evaluation On Imbalanced Datasets
2	The Application And Improvement Of SVM Algorithm In Imbalanced Datasets
3	Research And Application On CFS-HDRF Classification Algorithm For Imbalanced Data Set
4	A Symmetric Flipping Algorithm Research For Imbalanced Datasets Based On GMM-EM
5	Neural Network Based Classification Methods For Imbalanced Datasets
6	Research On Potential Home Broadband User Identification Problem With Large Scale Imbalanced Datasets
7	Classification On Imbalanced Datasets
8	Research On Imbalanced Dataset Classification
9	Research Of Imbalanced Datasets Preprocessing Combined With Clustering
10	Research And Application Of Imbalanced Data Classification Algorithm