Font Size: a A A

Design And Implementation Of Web Information Extraction Subsystem In The Public Opinion System

Posted on:2014-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2268330401466209Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Network public opinion can reflect the attitude of the general public on a varietyof events, it is an important channel for the relevant departments to understand thepublic opinion. Web information extraction is the input of public opinion in publicopinion analysis system, it directs the source and judgment of public opinioninformation. With the rapid development of the Internet, there is more and more Webforms. In order to obtain the information of public opinion quickly and accurately,there is increasingly high demand for Web information.This thesis studies the Web information extraction technology on the above issues,on the basis of in-depth analysis of the current Web page forms and the structure of thepage, combine with analysis requirements of public opinion, and proposes differentextraction methods for the four different sources of public opinion information page,such as news, blogs, forums and microblogging.The research says:1. This thesis research Web information extraction technology of news anf blogclass, use common page text extraction technology to extract mian body, and useregular expressions to extract other data items. The method does not depend on pagestructure, at the same time, it has high extraction speed, high accuracy and versatility.2. This thesis researches Web clustering algorithm, and proposes a clusteringmethod based HTML tag tree, the method is based on forum page structure, itcalculates the node’s value of HTML tag tree, it uses weighted cosine similarityformula to calculate the similarity of two Web tag tree. The method has good clusteringresults, and its time complexity is O(n).3. This thesis researches Web information extraction technology automaticallybased on similarity comparison of web-based structure, and propose a forum pageautomatic information extraction method. This method can automatically generate anextraction template for each forum website, the template uses entropy、 structuralsimilarity and others features to confirm specific information in forums, in the end, it isused to extract information of other pages on this web.4.This thesis researches Web information extraction of microblog class pages, we propose a new extraction method combined with tag attribute and regular expression,the method takes many kinds of characteristics of the data items into account, it usesthe label attributes and attribute values to locate, and uses regular expressions tocomplete the precise extraction.The experiments show that, this thesis proposes four website informationextraction method for news、blogs、forums and microblog pages, this methods canextract information from the massive network rapidly and accurately. They had highrecall and precision rates, and can transform extract data into structured data which arestored in the database. All these methods meet the requirements of public opinion dataanalysis system.
Keywords/Search Tags:information extraction, automatic generate templates, Web page clustering
PDF Full Text Request
Related items