Research Of Optimizing On KNN Algorithm Based On Clustering Concept In Text Classification

Posted on:2014-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:Q F Lin

Full Text:PDF

GTID:2268330401485893

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

In the era of knowledge and economy, information has become one of the most important signs; more and more attentions focus on the management and obtaining of information. The structure of the information has changed, from structured, semi-structured to unstructured. The organization and processing of unstructured information have become increasingly important, text classification as one of the key technologies have been widely used in the field of information retrieval, knowledge discovery and management, but for amount of information in vast texts, the efficiency and accuracy of text classification severely restrict the application in the field of real-time.The existing classification methods are based on statistical theory and machine learning; well-known methods of text-classification are Bayes, k-nearest neighbors, support vector machine, neurons network. In these methods, KNN as a simple, effective, non-parametric method, widely used in text classification and achieved good results of classification. But KNN algorithm calculation limits its application in the field of real-time, so how to improve the efficiency of KNN has been widely concerned, the focus of this paper also gathered on improving the efficiency of text categorization without decreasing classification accuracy.In this paper, we proposed the clustering concept, and then base on this clustering, we present two methods, Feature bit-string, Feature Multi-Class Matrix, to improve the efficiency of text classification.The main innovation of this paper as follow:1. The clustering concept based on the semantic. In the text, there will be some different forms of each concept, and the similarity calculation of texts can not recognize the relationship between these kinds of word, and we often ignore the effects of them. In this paper, we would cluster these into a concept. Experimental results show that the clustering concept can effectively express the meaning of these features, and the similarity calculation of texts can also reflect its contribution, and improve the accuracy of text classification and reduce the dimension of the text vector.2. The amount of computation could be reduced by feature bit-string of text. To overcome the large calculation of KNN algorithm, we proposed feature bit-string of text, which could quickly filter out the texts that may be similar with test text. Thereby reducing training set of texts could decrease the amount of computation of KNN. Theoretical analysis and experiments show that feature bit-string of text could improve the efficiency of KNN classification algorithm without reducing the classification accuracy.3. The feature multi-class matrix also could reduce the amount of computation of KNN algorithm. After the analysis of KNN, we could not only filter the similar text, but also get rid of the impossible class, to reduce the amount of computation. Thereby reducing the possible category of the text could reduce the amount of calculation by feature multi-class matrix. The experiments show that feature multi-class matrix can effectively improve the speed of classification.

Keywords/Search Tags:

Feature Bit-string, Feature Multi-Class Matrix, clusteringconcept, KNN, text classification

PDF Full Text Request

Related items

1	Research And Implementation Of Feature Selection In Chinese Text Classification
2	Research On Classification And New Class Recognition Of Complaint Text In Business
3	Based On The Rapid Large-scale Text Hierarchical Classification Problem Of Centralized
4	Research On Chi-square Statistic Feature Selection Method And TF-IDF Feature Weighting Method For Chinese Text Classification
5	Researching And Application Of Multi-hierarchy Text Classification Technology
6	Research On Some Problems In Text Classification
7	The Research On Feature Selection Methods For Text Classification
8	Study On Text Classification Based On Multi-class Support Vector Machines
9	Analysis And Application For Web Text Classification Based On Support Vector Machine
10	A Text Automatic Classification System Of Class-Based Feature Selection Algorithm