Research Of Chinese Text Classification Algorithms Based On VSM

Posted on:2019-04-13

Degree:Master

Type:Thesis

Country:China

Candidate:W Z Yu

Full Text:PDF

GTID:2428330566995919

Subject:Signal and Information Processing

Abstract/Summary:

With the rapid development of network technology,many information resources appear in the form of text.People expect to get useful information quickly and accurately by processing text data.Text classification,as an important way to process documents,plays an important role in information processing.In recent years,with the further research of machine learning,the technologies of text classification have been developing rapidly.However,the accurate classification of text in a large number of text is not as simple as supposed,it generally needs to go through a complex text preprocessing,including feature extraction,feature weighting,classifier training,etc.The algorithm used in these specific steps,in which there are still many aspects to be improved.In this thesis,some algorithms on feature extraction,feature weighting and classifier are studied and improved.The main research contents are outlined as follows:(1)Aiming at the problem that Chinese feature words are not considered in feature extraction,an improved information gain algorithm based on information entropy weight is proposed in this thesis.The algorithm uses information entropy to measure the distribution of Chinese feature words among classes,and give corresponding weights to the information gain of different class.The simulation results show that the improved algorithm has good classification effect.(2)Aiming at the fact that the traditional TF-IDF feature weighting algorithm does not consider the influence of the distribution of features among classes on classification,a new TF-IDF-ICL(term frequency inverse-document frequency inter-class content)is proposed based on the concept of physical gravity moment.The simulation results show that the proposed algorithm can effectively improve the accuracy and recall rate of text classification.(3)Aiming at the fact that the attribute independence of naive Bayes theory does not accord with the objective reality,this thesis proposes a naive Bayes text classification algorithm based on mutual information weighting.This method uses mutual information to weight the feature words in different categories,and partly eliminates the influence of the assumption on the classification.The simulation results show that the improved algorithm has a good classification effect.

Keywords/Search Tags:

text classification, information gain, information entropy, moment of gravity, TF-IDF, mutual information, naive Bayes classification

Related items

1	The Research Of Multi-layer Hidden Naive Bayes Algorithm Based On Mutual Information
2	The Research And Implement Of Naive Bayes Text Classification Algorithm
3	Research On Text Classification Algorithms Based On Machine Learning
4	Research And Improvement To Text Classification Algorithm
5	Research Of Improved Mutual Information-Based Naive Bayesian Classification Model
6	Research On Term Weighting Approach Based On Information Gain And Entropy
7	Study Of Spam-filtering Based On Text Classification
8	The Research And Implementation Of Text Classification Based On Meta-Information And Optimization
9	The Research And Implementation Of Text Classification Based On Meta-information And Optimization
10	Research On Text Classification Algorithm Based On Naive Bayes Method