Font Size: a A A

Context Semantic-based Adaboost-NB For Text Classification

Posted on:2019-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:K Y ZhengFull Text:PDF
GTID:2428330593950516Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the fast development of Internet and information technology,portal sites store data in various categories and the size of the data is growing exponentially.The big amount of information stored in the websites is stored in text type.Digging out information from the text and making a big jump in a utomatic text classification technology can be the most popular topic of data mining.The common use of classification is based on single classifier which comes to a certain bottleneck and can not make a big breakthrough.Among the algorithms,Naive Bayes is likely to result in low variance and high deviation because of its simple implementation.The traditional Naive Bayes is based on the condition that each characteristic of the sample is independent.The condition is rarely completely satisfied in the really world.The algorithm do not take a good consider of the synonyms in text classification.It will train the synonyms as independent words which makes the weight of synonyms lower and affects the accuracy of the classifier.It gets the text features by extracting high weight words as feature vectors.Because it lacks consideration of the sequence of the words and the semantic feature in the text which leads to lose a lot of useful information.When classifying long text,the feature vectors are usually high dimensional and sparse.The algorithm can not make good use of the corpus information when it comes to a lot of low frequency words which will absolutely reduce the efficiency of classification algorithm.When using the algorithm in specific business areas,the small size of training data and the uneven text data used in the training model make the data distribution in training unreliably.Aiming at the problem proposed above about single classifier,this article proposes AdaBoost-NB to make a combination of lifting algorithm and Naive Bayes to optimize and improve the performance of classifier.To improve the lifting algorithm AdaBoost,this article proposes a new weight allocation method.The weight of each base classifier is calculated by the error rate.In each round of training,all the error instances of classification are weighted by the same factor to lift its weight.It calculates the new allocation of weight by the rate of misclassification of the classification instance.The weight is bigger when the high error rate and the weight is smaller when the low error rate.This article improves the variety of base classifier by lifting the weight of sample which can train a more suitable base classifier.The kind of classifier can reduce the number of training iterations and noise which can improve the performance of classification.Context Semantic-based Naive Bayesian Text Classifier?CSNB?which is proposed in the second part of this article is aiming at improving the problem of Naive Bayes about ignoring synonyms,the uneven training samples,high dimensional sparse distribution of text.Because synonyms need to be manually defined and maintained resulting in a large amount of time cost.In this part,this article first proposes a extended concept of synonyms-the words which are most likely to appear at the same time on the same context.Similar words are obtained dynamically through the training of language model,and similar words are involved in the training of classification model,so that the model can be integrated into the context semantic knowledge.As for the implement,using the neural probabilistic language model word2vec to obtain similar words,the word2vec model and the training of the classification model need to use two languages.Firstly,it collects the word vectors by training the large number of related corpus collected through the word vector model word2vec.Then it gets the feature dictionary by analyzing the word vector and produces the similar words clusters by using the semantic cluster ing algorithm.Finally it trains the model by merging the similar words.The clustering algorithm is implemented by combining with the hierarchical clustering method for semantic clustering selection which has already been proposed above.It uses the training result by word2vec to complete the dictionary by dynamically building the word cluster dictionary in order to save the text sparsity problem which is definitely reducing the problem caused by the text sparsity.The third part of the article is mainly combined with the result of the first two parts.It proposes the training model called Context Semantic-based AdaBoost-NB Text Classifier?CSAda-NB?on the condition of fully considering of limitation of single classifier,the problem of text sparsity and the ignoring of synonyms of Na?ve Bayes.This model is combining AdaBoost-NB with context semantic factors.The F1value of testing result is promoting by 5%comparing with the result of AdaBoost-NB.The model improves the accuracy of classification by adding semantic factors.By comparing with the factor of standard deviation of CSAda-NB and AdaBoost-NB,it shows that the model reduces the instability of the classification results by analysing a large number of related corpora.
Keywords/Search Tags:Naive Bayes, AdaBoost, word2vec, similar word
PDF Full Text Request
Related items