Font Size: a A A

Extraction Of Chi-square Features In Chinese Text Classification And Improvement Of TF-IDF Weight

Posted on:2018-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:J T ShiFull Text:PDF
GTID:2348330518966572Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
In the 21 st century,with the rapid development of the Internet and information technology,information to exponential growth rate,the amount of information obtained by human beings has been very large,which contains a lot of text information,but how to timely and efficient access to the required information It becomes very difficult,the text classification can effectively solve this problem,in the information filtering,automatic digest,digital library,text database and other fields widely used.Therefore,the study of text categorization has important theoretical and broad application prospects.Feature selection can select the most representative text content from the high-dimensional feature space,which can improve the efficiency and precision of text classification.The feature weighting assigns different weights to the differentiated ability of the feature word.Feature selection and feature weighting are two important aspects of the text classification process.This article as the main research content,to obtain the results are as follows:First,the paper analyzes the commonly used feature extraction methods,including document frequency,mutual information,information gain,chi-square statistics,correlation coefficient,and deeply studies the chi-square statistics.For the traditional chi-square statistics,There is a bias factor,the introduction of word frequency factor,for the chi-square statistics tend to choose in other categories in a large number of appear in the specified class rarely appear in the characteristics of the word,the introduction of inter-class concentration coefficient and correction coefficient to be improved,put forward word frequency factor Improved inter-class concentration coefficient,improved coefficient of chi-square statistics.Secondly,the common feature weighting method is analyzed,and the defects of the traditional TFIDF weight are analyzed.The TFIDF weight ignores the distribution of the feature items within the category and the category when the weight of the feature word is weighted,and proposes a method of combining the logarithmic Calculation Method of TFIDF Weight for Statistical and Intra-class Information Entropy.Finally,two groups of comparative experiments are carried out to validate the improved chi square statistic algorithm and the improved TFIDF algorithm.The effectiveness and feasibility,at the same Chinese corpus as the data set for the two groups of experiments,the results show that the improved chi square statistic algorithm and improved TFIDF feature weighting algorithm with the traditional method,the improved method can be compared to each class precision,recall and F1 value and the overall precision and recall rate F1 value has been significantly improved.
Keywords/Search Tags:Text Classfication, Vector Space Model, Chi Square Statistics, Feature Selection, Feature Weighting
PDF Full Text Request
Related items