Font Size: a A A

Research On Chi-square Statistic Feature Selection Method And TF-IDF Feature Weighting Method For Chinese Text Classification

Posted on:2017-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:H Y YaoFull Text:PDF
GTID:2308330482495082Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the high-speed spread of information technology and continuous development of the Internet,the amount of information one can get through the Internet is exploding.Since most of the information exists in the form of text, it has already become a top issue of study that how to get the useful information from the mass text messages fast, accurately and comprehensively. As one of the key technologies of information retrieval, text classification technology can organize and process large scale of texts and has been applied in many fields. Feature selection and feature weighting occupy a very important position in the Chinese text categorization process.Feature selection method is used for reducing the dimension of text feature space to improve the classification efficiency by choosing words with a larger discrimination ability as distinguishing feature items and filtering those words with small influence.Feature weighting method gives different weights to different feature items based on the contribution of each feature word in order to distinguish their classification capabilities.The result of feature selection and feature weighting algorithm has a direct impact on the effect of text categorization,therefore seeking an effective feature selection and feature weighting algorithm in text classification has become a crucial issue.This article took the improvement of feature selection method and weighting algorithm of text classification as the center.First, this article introduces some theoretical knowledge and basic work of text classification including pretreatment technology, model representation, text classification methods and so on.Secondly, this article centers on the text categorization feature selection method and introduces several commonly used feature selection methods to analyzethe advantages and disadvantages of each method. Since the traditional chi-square statistic method ignores the problem of low-frequency words’ defects and the deficiencies that different feature item has different distribution in a specific category,this article improves the chi-square statistic method by introducing two factors including feature item’s frequency and information entropy of a specific class.Because the traditional chi-square statistic method tends to choose those feature items which is negatively correlated with categories,this paper recommends the correction factor to make up for this deficiency of the chi-square statistic method.Ultimately this paper presents a improved chi-square statistic method named as ICHI based on feature item’s frequency and information entropy of a specific class. At the same time,some commonly used feature weight calculation methods are studied and discussed in this paper.Since the traditional TF-IDF weight calculation method ignores the distribution of the words in a specific category,we use information entropy of a specific class to compensate for the defect.CHI Square is combined with the traditional TF-IDF weight calculation method to remedy the defect that the TF-IDF method ignores the distribution of the words between categories. Eventually this paper presents a improved TF-IDF weight calculation method based on information entropy of a specific class and CHI Square.Finally, in order to verify the feasibility and effectiveness of the improved chi-square statistic method and the improved TF-IDF weight calculation method,this paper carried out two comparative experiments on Chinese text classification platform based on a Chinese data set of Fudan University and the result is valued by several factors such as the accuracy, recall, F1 value, etc.The results of this experiment demonstrated that the chi-square statistic method the paper proposed has a better dimension reduction effect than traditional feature selection methods and the improved TF-IDF algorithm has a better effect in terms of computing feature weightthan traditional methods, so that the accuracy and efficiency of Chinese text classification is further improved.
Keywords/Search Tags:text classification, feature weighting, feature selection, item’s frequency, information entropy of a specific class
PDF Full Text Request
Related items