Font Size: a A A

Research On Text Classification Based On Improved Information Gain And LDA

Posted on:2019-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:F Z ZhangFull Text:PDF
GTID:2428330548456877Subject:Engineering
Abstract/Summary:PDF Full Text Request
As the network becomes more and more popular,how to obtain target information from massive information quickly and effectively becomes the focus of attention.Although there are many forms of network information,such as images,audio,video,etc.,80% of the information is still presented in text format.Therefore,whether or not text information can be efficiently managed becomes a smooth response to massive information.The key is text categorization,as an efficient way to manage text information,has always been a research hotspot in the field of data mining.Text information is characterized by high dimensionality,high dispersion,sparseness,and so on.However,Chinese is broad and profound,often with the words polysemy,monosynyty,etc.,and unlike the natural division of English words,the algorithm of word segmentation The good or bad can influence the final classification effect to a great extent,so the text classification for Chinese has higher complexity,these factors bring a lot of challenges in the classification process,and seriously affect the accuracy of text classification.Therefore,how to improve the efficiency of feature selection has become the key to improving the text classification effect.Traditional information gain algorithms often have the following drawbacks.First,because the correlation between word frequency information and category information of feature words is not taken into account,the information gain calculation of feature words will be inaccurate.Second,since the traditional algorithm is based on statistics,feature words are ignored.The correlation between them leads to failure to consider semantic information.This paper proposes an information gain algorithm based on class information and combines it with the LDA topic model to solve the above problems.In view of the first drawback,this paper will analyze the concept of the classfrom the perspective of between the classes and within the class,and combine the between-class term frequency,between-class dispersity,intra-class word frequency,intra-class polymerization with information gain.In combination,the information gain calculation formula is modified by adding these values.between-class word frequency and between-class dispersity describe the distribution of feature words in each class,reflect the dispersion of feature words between classes,and mainly represent the degree of representation of feature words to certain classes;intra-class word frequency,intra-class polymerization describes the distribution of feature words in each text within a specified class,reflects the distribution of feature words in each text in the category,and mainly represents the representative degree of feature words to a specified category.For the second shortcoming,this paper will adopt the LDA topic model to solve the problem.LDA expresses the correlation between the feature words through the same topic.In this way,some semantically similar feature words will be linked together.Meanwhile redundant information such as synonym will be removed.Finally,the classification is performed by SVM classification algorithm,which is compared by using different feature selection algorithms: traditional information gain algorithm,information gain algorithm based on class information,LDA topic model,information gain algorithm based on class information and LDA topic model.Experiments show that the combination of the information-based information gain algorithm proposed in this paper and the LDA topic model can improve the classification effect.
Keywords/Search Tags:Text classification, Improved information gain algorithm, LDA topic model, Classifier
PDF Full Text Request
Related items