Font Size: a A A

Research On Term Weighting Approach Based On Information Gain And Entropy

Posted on:2013-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:H R LiFull Text:PDF
GTID:2248330362474370Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Facing with the increasingly expanding information on the Internet, people oftenfeel confused and lost in the vast amount of information resources. How to find theinformation required accurately and efficiently from these resources have become animportant problem for the researchers. As an effective method of organizing andmanaging data, Text classification techniques can greatly improve the disorganizedphenomenon of information on the Internet, reduce the search space, accelerate theretrieval speed and improve the query accuracy. As the core step of text classification,the accuracy of term weight calculation has a significant impact on the result of the textclassification.This paper first analyzes the advantages and disadvantages of TFIDF, which is atraditional algorithm of term weight calculation. Then to overcome the disadvantages ofthe algorithm, this paper proposes a new method of term weight calculation based oninformation gain and information entropy, which can make the result of the term weightcalculation more precise and improve the accuracy of text classification.The main research works of this paper are as follows:①This paper analyzes the common feature selection algorithms, and thencompares the experimental results of three different feature selection algorithmsincluding DF, IG and CHI. The comparison results show that IG algorithm can getbetter result, so it is used as the feature selection algorithm in this paper.②Introduce the existing methods of term weight calculation including Booleanweights, DF, Entropy values, TFIDF and so on. Then this paper analyzes the advantagesand disadvantages of TFIDF algorithm, and summarizes the existing improved methodsfor its disadvantages.③To overcome the disadvantages of the traditional TFIDF algorithm, this paperproposes a new method called TFIDFIGE, in which the information gain and entropyare two important factors. Compared with the traditional TFIDF algorithm, the methodproposed in this paper considers the influence of the distribution of the feature wordsamong and inside class, which can improve the accuracy of term weight calculation. Inaddition, according to the remove of isolated points, the method can effectively reducethe dimension of text vector and achieve the result of reducing the time and spacecomplexity of text classification. Finally, the text data sets are downloaded from NetEase News, Sina News andPhoenix News according to the web crawler and7700texts are selected randomly as theexperimental data sets. Then comparison experiments of KNN and Na ve Bayesclassifier based on three different term weighting approaches including TFIDF,TFIDFIG and TFIDFIGE are carried out. The experimental results show that the methodproposed in this paper overcomes the disadvantages of the traditional TFIDF andperforms better than the other two in precision, recall, F-measure of the textclassification.
Keywords/Search Tags:TFIDF, text classification, term weight calculation, information entropy, information gain
PDF Full Text Request
Related items