Font Size: a A A

Study Of Chinese Text Classification

Posted on:2011-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:W L LvFull Text:PDF
GTID:2178330305460423Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and the popularity of Internet, information resources increase dramatically. How to manage and use the large quantity of information resources effectively and provide convenience for the fast, comprehensive, accurate retrieval for users becomes a problem to be solved. As an important information processing technology, text classification provides solutions for the information resources'fast, comprehensive, accurate organization and use, and provides users with efficient retrieval. Therefore, text classification has become an increasingly important research area.This paper concentrated on the key technologies of Chinese text of the classification, including the Chinese words cut, text representation, feature selection and weight calculation, text classification algorithm, classifier performance evaluation, etc. Then, this paper researched deeply against the two core issues in classification process-feature selection and weighting algorithm and text categorization algorithm. Feature selection and weight calculation means selecting terms which best represent the text content and weighting them to removing the negative effects and promoting the speed and accuracy of text classification; the classification algorithm is good or bad will directly affect the classification results.There are two innovations in this paper:in weight calculation, this paper analyzed the two defects of classic feature weighting method TFIDF:Dataset deflection and distribution differences between classes and in class. And then, aimed at the second defect, MI feature selecting algorithm was introduced into weight calculation to resolve distribution differences between classes, and use formula correction factor to make up the difference within a class; in classification algorithm design, SE-Bagging KNN algorithm was designed on the basis of the previous research results. The algorithm introduced selective ensemble into Chinese text classification, based on property sensitive of KNN algorithm, using repeatable sampling technique to generate new training sets, which was used to train KNN weak classifier, and then used variable similarity clustering technology to select variation and better classifiers to classify new texts. Experiment was done on.Net platform, using WordSeg component and Weka data mining package, and the result proved that compared with traditional classification algorithm, SE-Bagging KNN text classification algorithm has obvious improvement in classification quality.
Keywords/Search Tags:text classification, feature selection, weight calculation, selective ensemble, variable similarity clustering
PDF Full Text Request
Related items