Font Size: a A A

On Research For Chinese Automatic Text Categorization Technology Based On VSM Model And Feature Selection

Posted on:2012-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:K H ZhuFull Text:PDF
GTID:2218330338468313Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Text classification is refer to:on the basis of the text contents, according to a automatic classification algorithm, computer divided texts into pre-defined categories. Text classification is an important issue in the field of information storage and information retrieval. Text classification has great signficant in information retrieval, information filtering, personalized information services and so on. However, along with the rapid growth of of digital documents information, the process of a large-scale text has already become a challenge. One of basic difficulties for text automatic classification is the characteristic space dimension is excessively high, on the one hand quantity oversized characteristic item causes the price of classify algorithm to be excessively high, on the other hand causes to be unable to get the documents category information accurately,and then causes the classified effect not to be good. Therefore,we needs a method to reduce a characteristic space dimension under not to sacrifice the classified quality condition the premise as far as possible.This study is how to use vector space model (VSM) and feature selection techniques to reduce the dimension of feature space and reduce the influence of the classification. In order to reduce the dimension of feature space, this paper propose improved feature selection algorithm tf-idf and mutual information, and also proposed a measure method– CPD,we carry it on the text features selection.In this paper, the main contents and innovations are as follows:1. In this paper,we aim for tf-idf shortcomings in the vector space model (VSM), propose an improved method, the traditional tf-idf method does not consider in the order and position of the feature item in the text, improved tf-idf add in one category contained'word'text numbers and the numbers of word occur in the text , the method contribute to select the feature items,it can be distinguished clearly the importance of feature items;2. In order to reduce the dimension of feature vector space,This paper propose an improved method, this feature selection method is aim for the mutual information method, the improved mutual information is an effective solution for the edge of the distribution word lead to rare words values larger , to some extent,it solve the "over-fitting" problem;3. In reducing the dimension of feature space methods, this paper also introduces discrimination on the measure, we propose a discrimination feature selection method on the basis of discrimination, - the proportion of discrimination (CPD) method, and compared with other several feature selection algorithms ;4. This article design a Chinese text classification system, which contains five parts: pre-classification, feature selection, text representation, text classification and evaluation. Pre-classification contain training set participle,remove stop words, the digital filter;In feature selection part,it is compare with several methods based on document frequency, mutual information, chi-square statistics and improved mutual information, the proportion of the discrimination, meanwhile, using tf-idf and improved tf-idf method calculate the weight of feature item in the text ;In classification section,it is to use the SVM classifier for classification.5. All experiments in this paper are take Chinese text classification corpus-TanCorpvl.0 as experimental data, using a Chinese word segmentation system ICTCLAS to carry on the participle for the text,and take the value of micro-average and macro-average as classification performance assessment .
Keywords/Search Tags:text classification, Knn, tf-idf, feature selection, vector space model, mutual information
PDF Full Text Request
Related items