Font Size: a A A

Research On Extraction And Recognition Of Internet Bad Information Collection

Posted on:2017-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:H L YuFull Text:PDF
GTID:2278330488465635Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development in recent years, forums, communities, Weibo and other social networks, accelerate Internet users posted on the Internet, the speed of dissemination of information, users can express their unfettered political views on social networks, review emergencies supervision of public events. This release also makes yellow, violence, gambling on the web or even reactionary remarks have occurred and other undesirable information, such adverse information website With the rapid development of the Internet presents a modern spreading, caused great concern to the community. Therefore, how to strengthen the ability to identify inappropriate web content, improve the quality and efficiency of monitoring public opinion has become an important problem to be solved a technical worker.Thesis research work based on "Internet bad information monitoring and management platform" for the support, from the practical application of project needs, while theoretical innovation. Internet bad information monitoring and management platform, web crawling and extraction of information and bad information research status and problems identified by a brief analysis, design and implementation of information collection based forum Weibo bad information extraction and identification scheme, based on the introduction of hidden Markov word polarity labeling on word2vec word expansion improvements sensitive basis lexicon was expanded feature words, accuracy and feasibility of the method was evaluated and verified.In research forums, Weibo collection and extracted by the Forum and Weibo structural analysis, design a set for the forums, Weibo page acquisition program, the flexibility to be resolved for different "elements" to make corresponding targeted extraction, and visualization of the corresponding page crawling and extracting configuration settings. When parsing a site, according to each "element" of the extraction rules parses the text for each "element" of the site pages, and packaged into a standardized document. Experiments show that the method can be developed according to the rules of fast and convenient to extract page information, and better accuracy and recall rate.In the study to identify bad information, the use of sensitive basis word2vec related words thesaurus expansion, the introduction of the word based on hidden Markov polarity mark, filter out "away from the word" form a feature word set, according to the last word right characteristics value calculating and combining SVM classifier to complete the identification work on bad information, get a good recognition results.
Keywords/Search Tags:Crawl information, information extraction, bad information identification, word2vec
PDF Full Text Request
Related items