Font Size: a A A

Forum Data Extraction Based On Similarity Calculation

Posted on:2021-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2438330626454525Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet forums have been an important source of information in recent years,providing wide variety and large capacity of valuable information and references,thus becoming an indispensable part of social life.However,information transmitted on Internet forums are admixture of the genuine and the false,such as articles with quite different themes and annoying advertisements,which leads to low user satisfaction.Therefore,the classification of forum data is a quick way to get the right information.This paper mainly categorizes articles on Internet forums through segmentation of the article content,keyword extraction,similarity calculation,etc.,so as to achieve the effect of text classification.Article classification can provide users with more accurate data and valuable information.This thesis mainly proposes a forum data classification method based on cross validation Bayesian model.The innovation of this paper mainly lies in proposing "cross validation Bayes classification method",which introduces the concept of N-Gram,and combines it with TF-IDF algorithm to extract keywords.Naive Bayes algorithm optimized by cross validation is used in similarity calculation,thus achieving the classification of forum articles.N-Gram algorithm combined with TF-IDF algorithm for keyword extraction can not only extract the keywords of the article,but also retain the location information of the lexical features;and the cross validation method can fully consider the distribution information of the lexical features in the text set,so that the Bayesian model can better fit the actual results.Therefore,cross validation Bayes classification method has better accuracy in the classification results theoretically.In this paper,the results of text classification using cross validation Bayesian classification model are compared with the results of text classification based on logical regression,and the comparison results finally confirm the practicality and effectiveness of the method brought up in this paper.In this paper,the effectiveness of cross validation Bayesian text classification method is verified by testing the offline forum text data.
Keywords/Search Tags:text classification, N-gram, keyword extraction, similarity calculation, cross validation
PDF Full Text Request
Related items