Font Size: a A A

Research Of Chinese Text Classification Based On KNN

Posted on:2011-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2178360305461147Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, people can get more and more from the network, such as text, images, sound and other forms of information, most of them are semi-structured or unstructured text information, how to use text classification technical to manage is very important. Text classification techniques to solve the problem of information clutter, and it has become the base of information retrieval, search engine field and so on. So it has significant meanings for the research of text categorization.This paper introduces the relative theories of Chinese text classification, such as the vector space model, the Chinese word segmentation, the feature selection, the classification method, evaluation indicator, weight calculation method and similarity calculation method.Through analyzing the TFIDF in details, contrapose it only consider the shortage of word frequency and the feature distribution in the training text set, proposed an improved scheme add the feature distribution in each class and all texts within class into the original formula.Contrapose the shortcomings in calculating the text similarity, put out one improved scheme based on the in-depth analysis of KNN classification method. The new scheme introduces the idea of central vector classification method, and taking into account the number of common feature between the text to be classified and the training text is important to the classification.Based on the theoretical research, construct a Chinese text categorization system including four functional modules, which are pretreatment module, feature selection module, classification module and evaluation and display module. This system uses SQL Server 2000 as its back-end database, and is realized through C# language.Finally using the realized Chinese text categorization system as the testing platform, verify the validity and feasibility of improvement TFIDF weight calculation method and KNN classification method through experiment.
Keywords/Search Tags:Text Categorization, Weight Calculation, K-Nearest Neighbors, Similarity
PDF Full Text Request
Related items