Font Size: a A A

The Research On K-nearest Neighbor Chinese Text Categorization Algorithm

Posted on:2011-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:T LuFull Text:PDF
GTID:2178360308973011Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In order to search or extract information in a special category from large data source , so text automatic categorization has become a hot subject of research. KNN is an important method of text automatic Classification, and it can deal with large data sets with more stability. Based on the comprehensive overview of Chinese Text Categorization, this thesis focuses on the research of KNN algorithm. The main contents of this thesis are as follows:(1) The thesis makes a summary of the research background and development status about text Classification algorithm. It introduces the general process of Chinese text categorization, including the key technologies and the methods of its quality assessment.(2) KNN text categorization for large scale data processing, there is a problem of slow classification speed. Aiming at this question, a kernel-KNN algorithm based on KNN categorization is proposed, it introduces the semantic relation of feature items, and clusters to build center documents. This method reduces the number of documents which KNN should search, and increases the speed of categorization. Simulation results show that the proposed algorithm improves the classification speed.(3) A category method is proposed to lower the effects of uneven distribution of different resources in a training set on text categorization. Based on k-KNN, it uses little Ks for testing the documents in the training set which between the edge of classes, and categorizes it into the right class. This method decreases the wrong classification between the edge of classes. The experiment shows that it has good performance.
Keywords/Search Tags:KNN Text Categorization, Clustering, Semantic Similarity, Training Set
PDF Full Text Request
Related items