Font Size: a A A

Research Of Chinese Text Classification Algorithms Based On VSM

Posted on:2019-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:W Z YuFull Text:PDF
GTID:2428330566995919Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology,many information resources appear in the form of text.People expect to get useful information quickly and accurately by processing text data.Text classification,as an important way to process documents,plays an important role in information processing.In recent years,with the further research of machine learning,the technologies of text classification have been developing rapidly.However,the accurate classification of text in a large number of text is not as simple as supposed,it generally needs to go through a complex text preprocessing,including feature extraction,feature weighting,classifier training,etc.The algorithm used in these specific steps,in which there are still many aspects to be improved.In this thesis,some algorithms on feature extraction,feature weighting and classifier are studied and improved.The main research contents are outlined as follows:(1)Aiming at the problem that Chinese feature words are not considered in feature extraction,an improved information gain algorithm based on information entropy weight is proposed in this thesis.The algorithm uses information entropy to measure the distribution of Chinese feature words among classes,and give corresponding weights to the information gain of different class.The simulation results show that the improved algorithm has good classification effect.(2)Aiming at the fact that the traditional TF-IDF feature weighting algorithm does not consider the influence of the distribution of features among classes on classification,a new TF-IDF-ICL(term frequency inverse-document frequency inter-class content)is proposed based on the concept of physical gravity moment.The simulation results show that the proposed algorithm can effectively improve the accuracy and recall rate of text classification.(3)Aiming at the fact that the attribute independence of naive Bayes theory does not accord with the objective reality,this thesis proposes a naive Bayes text classification algorithm based on mutual information weighting.This method uses mutual information to weight the feature words in different categories,and partly eliminates the influence of the assumption on the classification.The simulation results show that the improved algorithm has a good classification effect.
Keywords/Search Tags:text classification, information gain, information entropy, moment of gravity, TF-IDF, mutual information, naive Bayes classification
PDF Full Text Request
Related items