Font Size: a A A

Research And Implementation Of KNN Text Classification Based On CURE Clustering

Posted on:2015-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:L HuangFull Text:PDF
GTID:2208330431976819Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the coming of information age, the text information growth. Through classification techniques to text classification, information and knowledge needed to find, in recent years has been used for information retrieval and information found. In vector space model, the KNN is relatively good classification algorithm. In this paper, the text classification topic background, classification process, the related algorithms are studied, and the algorithm on the basis of these algorithms, the CURE clustering improved KNN text classification, KNN classification system and verify the effectiveness of the related improvement.In the evaluation of key functions, such as information gain, mutual information and cross entropy analysis, found the original function of characteristic evaluation defects of different level. This article will improve feature selection evaluation formula, compared before and after the improved classification accuracy and value of F1explore compare their performance. Through in-depth study of feature weighting algorithm found that TFIDF only consider when weighted key words in the whole training set, the influence of the lack of consideration of key within the text between class and class, thus put forward the corresponding improvement based on the information gain.In text classification, K-Nearest Neighbor (KNN) algorithm must calculates the sample vector and the training sample set all the degree of similarity between sample vectors. Classifier when the training sample too large amount of calculation, so the classification efficiency is low. Text puts forward a combination of clustering algorithm improved KNN text classification method, clustering algorithm on the basis of the method to choose CURE. Considering the CURE algorithm in the case of outlier outliers will increase, adjust the shrinkage factor in processing the delta, get rid of some isolated points to make sure that the cluster classification accuracy, thus reducing unnecessary interference, improve the efficiency of text classification and results. The text by implementing KNN Classifier, verify the proposed classification effect and classification efficiency of the algorithm. Sort out the text, in contrast to choose different corpus implement classification results and output. Using evaluation index with precision, recall and Fl three classification evaluation index to improve the classification performance improvements.
Keywords/Search Tags:Text Classification, KNN, Feature Selection, TFIDF, CUREClustering
PDF Full Text Request
Related items