Font Size: a A A

A Study Of Text Character Extraction And Classification Technology For Forum

Posted on:2016-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:L XiaoFull Text:PDF
GTID:2308330479477734Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet Era, a variety of online forums have become a style of "newspaper" among people.The most active community, sorts of rumor, headlines and so on have become the seeds of this era of information dissemination and propagation.With the freedom of cyberspace, this part of the information for effective management is an important aspect of social monitoring feature extraction, classification research, public opinion, but also a necessary means to guide the healthy development of society.In this paper, we supported by Hebei opinion monitoring project Aim at Shijiazhuang high concern, "said Shi Shi words" section content data, we obtain acquisition, character extraction, classification research, getting the most concerned topic around people. Concrete steps we take as follows:First, the author direct at the characteristics of website(complex and disorder)and structural features of itself, using web crawler technology design data downloader, by common China trash removal, segmentation, go high and low frequency words and other technologies to get the original data set.Then, in order to effectively represent text data sets, the paper adopted the classical LDA topic model. In the feature extraction process, the author in the text feature selection for words of higher dimension and repeated theme features, as well as applications in classification accuracy varies dataset volatile issues. We use an approach based on meaning dimension reduction theme feature selection algorithm. Which algorithm shows that the correlation between the probability of words and semantic correlation fused together by linguistics and statistical knowledge, so that the theme of the document to be more representative.Finally, the authors extracted feature vectors for text feature, set the similarity threshold by using the nearest neighbor method to classify data and statistics hot spots.
Keywords/Search Tags:LDA, topic model, the theme features, K neighbors
PDF Full Text Request
Related items