Font Size: a A A

Research Of Optimizing On KNN Algorithm Based On Clustering Concept In Text Classification

Posted on:2014-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:Q F LinFull Text:PDF
GTID:2268330401485893Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In the era of knowledge and economy, information has become one of the most important signs; more and more attentions focus on the management and obtaining of information. The structure of the information has changed, from structured, semi-structured to unstructured. The organization and processing of unstructured information have become increasingly important, text classification as one of the key technologies have been widely used in the field of information retrieval, knowledge discovery and management, but for amount of information in vast texts, the efficiency and accuracy of text classification severely restrict the application in the field of real-time.The existing classification methods are based on statistical theory and machine learning; well-known methods of text-classification are Bayes, k-nearest neighbors, support vector machine, neurons network. In these methods, KNN as a simple, effective, non-parametric method, widely used in text classification and achieved good results of classification. But KNN algorithm calculation limits its application in the field of real-time, so how to improve the efficiency of KNN has been widely concerned, the focus of this paper also gathered on improving the efficiency of text categorization without decreasing classification accuracy.In this paper, we proposed the clustering concept, and then base on this clustering, we present two methods, Feature bit-string, Feature Multi-Class Matrix, to improve the efficiency of text classification.The main innovation of this paper as follow:1. The clustering concept based on the semantic. In the text, there will be some different forms of each concept, and the similarity calculation of texts can not recognize the relationship between these kinds of word, and we often ignore the effects of them. In this paper, we would cluster these into a concept. Experimental results show that the clustering concept can effectively express the meaning of these features, and the similarity calculation of texts can also reflect its contribution, and improve the accuracy of text classification and reduce the dimension of the text vector.2. The amount of computation could be reduced by feature bit-string of text. To overcome the large calculation of KNN algorithm, we proposed feature bit-string of text, which could quickly filter out the texts that may be similar with test text. Thereby reducing training set of texts could decrease the amount of computation of KNN. Theoretical analysis and experiments show that feature bit-string of text could improve the efficiency of KNN classification algorithm without reducing the classification accuracy.3. The feature multi-class matrix also could reduce the amount of computation of KNN algorithm. After the analysis of KNN, we could not only filter the similar text, but also get rid of the impossible class, to reduce the amount of computation. Thereby reducing the possible category of the text could reduce the amount of calculation by feature multi-class matrix. The experiments show that feature multi-class matrix can effectively improve the speed of classification.
Keywords/Search Tags:Feature Bit-string, Feature Multi-Class Matrix, clusteringconcept, KNN, text classification
PDF Full Text Request
Related items