Font Size: a A A

The Design And Implementation Of Text Classification System Based On SVM-KNN

Posted on:2012-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:L Y WangFull Text:PDF
GTID:2268330425997118Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, people can get more and more information from the network, such as text, images, sound and other forms of information, and semi-structured or unstructured text information occupies the majority. How to use text classification techniques to classify the information and manage it is very important. Text classification technology solves the problem of information clutter, and it has become the basis of information filtering, search engine and other areas. Therefore, the research on text categorization has important significance.First the article describes the theory of the Chinese text categorization, such as: vector space model, feature selection, classification, evaluation, weight calculation method, and similarity calculation method.Then through detailed analysis of weight algorithm TFIDF, according to the algorithm of only consider term frequency of the feature item and the distribution of the whole training text, the article proposed add the distribution of each class and the various texts in one class of the feature item in the original formula. And analyzes and improves the information gain feature selection method. According to the performance of information gain method dropped significantly when the sample set is not uniform. Adding two variables, dispersion and concentration to reflect the characteristics of differentiation. And the performance of information gain method has further improved. Analyzes KNN and SVM classification method, according to their respective advantages and disadvantages, proposed SVM-KNN classification. For the inadequacies of the algorithms in the case of uneven distribution of the sample, adding the punishment mechanism.Based on the research in the theory, construct a Chinese text classification system, including pre-processing module, weight calculation and feature selection module, the classification module and the performance evaluation module. Using C++language. Finally, using the Chinese text classification system as a test platform. Use the Corpus of Sogou laboratory, through the experiment shows the improvement of weight calculation and feature selection method and SVM-KNN classification method is effective and feasible.
Keywords/Search Tags:Text Categorization, Weight Calculation, Feature Selection, K-NearestNeighbors, Support Vector Machine
PDF Full Text Request
Related items