Research On Extraction And Recognition Of Internet Bad Information Collection

Posted on:2017-04-28

Degree:Master

Type:Thesis

Country:China

Candidate:H L Yu

Full Text:PDF

GTID:2278330488465635

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development in recent years, forums, communities, Weibo and other social networks, accelerate Internet users posted on the Internet, the speed of dissemination of information, users can express their unfettered political views on social networks, review emergencies supervision of public events. This release also makes yellow, violence, gambling on the web or even reactionary remarks have occurred and other undesirable information, such adverse information website With the rapid development of the Internet presents a modern spreading, caused great concern to the community. Therefore, how to strengthen the ability to identify inappropriate web content, improve the quality and efficiency of monitoring public opinion has become an important problem to be solved a technical worker.Thesis research work based on "Internet bad information monitoring and management platform" for the support, from the practical application of project needs, while theoretical innovation. Internet bad information monitoring and management platform, web crawling and extraction of information and bad information research status and problems identified by a brief analysis, design and implementation of information collection based forum Weibo bad information extraction and identification scheme, based on the introduction of hidden Markov word polarity labeling on word2vec word expansion improvements sensitive basis lexicon was expanded feature words, accuracy and feasibility of the method was evaluated and verified.In research forums, Weibo collection and extracted by the Forum and Weibo structural analysis, design a set for the forums, Weibo page acquisition program, the flexibility to be resolved for different "elements" to make corresponding targeted extraction, and visualization of the corresponding page crawling and extracting configuration settings. When parsing a site, according to each "element" of the extraction rules parses the text for each "element" of the site pages, and packaged into a standardized document. Experiments show that the method can be developed according to the rules of fast and convenient to extract page information, and better accuracy and recall rate.In the study to identify bad information, the use of sensitive basis word2vec related words thesaurus expansion, the introduction of the word based on hidden Markov polarity mark, filter out "away from the word" form a feature word set, according to the last word right characteristics value calculating and combining SVM classifier to complete the identification work on bad information, get a good recognition results.

Keywords/Search Tags:

Crawl information, information extraction, bad information identification, word2vec

PDF Full Text Request

Related items

1	Research On Unstructured Instrumentation Inquiry Information Extraction Method
2	Design And Implementation Of Web Information Extraction Rules
3	The Information Leakage Detection Based On Text Information Extraction
4	Design And Implementation Of The Application Server For Maternal And Infant Information
5	Identification And Information Extraction Of Scholars’ Homepages
6	The Design And Implementation Of Web Information Extraction System
7	Found The Blog Knowledge-based Information Extraction Technology
8	On Demand Recommendation Of IOS App Based On Weighted Word2vec Fusion Of Multidimensional Information
9	Research On Competitive Information Extraction Based On Web
10	Research On Visual Positioning And Identification Technology Of Express Package Information Code