Research On Text Classification Based On Improved Information Gain And LDA

Posted on:2019-08-20

Degree:Master

Type:Thesis

Country:China

Candidate:F Z Zhang

Full Text:PDF

GTID:2428330548456877

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

As the network becomes more and more popular,how to obtain target information from massive information quickly and effectively becomes the focus of attention.Although there are many forms of network information,such as images,audio,video,etc.,80% of the information is still presented in text format.Therefore,whether or not text information can be efficiently managed becomes a smooth response to massive information.The key is text categorization,as an efficient way to manage text information,has always been a research hotspot in the field of data mining.Text information is characterized by high dimensionality,high dispersion,sparseness,and so on.However,Chinese is broad and profound,often with the words polysemy,monosynyty,etc.,and unlike the natural division of English words,the algorithm of word segmentation The good or bad can influence the final classification effect to a great extent,so the text classification for Chinese has higher complexity,these factors bring a lot of challenges in the classification process,and seriously affect the accuracy of text classification.Therefore,how to improve the efficiency of feature selection has become the key to improving the text classification effect.Traditional information gain algorithms often have the following drawbacks.First,because the correlation between word frequency information and category information of feature words is not taken into account,the information gain calculation of feature words will be inaccurate.Second,since the traditional algorithm is based on statistics,feature words are ignored.The correlation between them leads to failure to consider semantic information.This paper proposes an information gain algorithm based on class information and combines it with the LDA topic model to solve the above problems.In view of the first drawback,this paper will analyze the concept of the classfrom the perspective of between the classes and within the class,and combine the between-class term frequency,between-class dispersity,intra-class word frequency,intra-class polymerization with information gain.In combination,the information gain calculation formula is modified by adding these values.between-class word frequency and between-class dispersity describe the distribution of feature words in each class,reflect the dispersion of feature words between classes,and mainly represent the degree of representation of feature words to certain classes;intra-class word frequency,intra-class polymerization describes the distribution of feature words in each text within a specified class,reflects the distribution of feature words in each text in the category,and mainly represents the representative degree of feature words to a specified category.For the second shortcoming,this paper will adopt the LDA topic model to solve the problem.LDA expresses the correlation between the feature words through the same topic.In this way,some semantically similar feature words will be linked together.Meanwhile redundant information such as synonym will be removed.Finally,the classification is performed by SVM classification algorithm,which is compared by using different feature selection algorithms: traditional information gain algorithm,information gain algorithm based on class information,LDA topic model,information gain algorithm based on class information and LDA topic model.Experiments show that the combination of the information-based information gain algorithm proposed in this paper and the LDA topic model can improve the classification effect.

Keywords/Search Tags:

Text classification, Improved information gain algorithm, LDA topic model, Classifier

PDF Full Text Request

Related items

1	Research On The Text Classification Method Based On Correlated Topic Model
2	Text Classification Algorithm Based On Chinese And English Topic Space
3	The Research And Implementation Of Text Classification System Based On Classified Text Library
4	Research On Short Text Classification
5	The Study Of Short Text Classification Based On Ada Boost-GASVM Algorithm And LDA Topic Model
6	Research On Classificational Model Of Text Sentiment Based On Topic
7	Building Of Classification Method And Classifier About Text Complaints Information Based On Association Rules
8	The Text Categorization And Structure Of Theme Words Network Based On Topic Models
9	The Research And Implementation Of Text Classification Based On Meta-information And Optimization
10	The Research And Implementation Of Text Classification Based On Meta-Information And Optimization