Design And Implementation Of Web Information Extraction Subsystem In The Public Opinion System

Posted on:2014-10-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2268330401466209

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Network public opinion can reflect the attitude of the general public on a varietyof events, it is an important channel for the relevant departments to understand thepublic opinion. Web information extraction is the input of public opinion in publicopinion analysis system, it directs the source and judgment of public opinioninformation. With the rapid development of the Internet, there is more and more Webforms. In order to obtain the information of public opinion quickly and accurately,there is increasingly high demand for Web information.This thesis studies the Web information extraction technology on the above issues,on the basis of in-depth analysis of the current Web page forms and the structure of thepage, combine with analysis requirements of public opinion, and proposes differentextraction methods for the four different sources of public opinion information page,such as news, blogs, forums and microblogging.The research says:1. This thesis research Web information extraction technology of news anf blogclass, use common page text extraction technology to extract mian body, and useregular expressions to extract other data items. The method does not depend on pagestructure, at the same time, it has high extraction speed, high accuracy and versatility.2. This thesis researches Web clustering algorithm, and proposes a clusteringmethod based HTML tag tree, the method is based on forum page structure, itcalculates the node’s value of HTML tag tree, it uses weighted cosine similarityformula to calculate the similarity of two Web tag tree. The method has good clusteringresults, and its time complexity is O(n).3. This thesis researches Web information extraction technology automaticallybased on similarity comparison of web-based structure, and propose a forum pageautomatic information extraction method. This method can automatically generate anextraction template for each forum website, the template uses entropy、 structuralsimilarity and others features to confirm specific information in forums, in the end, it isused to extract information of other pages on this web.4.This thesis researches Web information extraction of microblog class pages, we propose a new extraction method combined with tag attribute and regular expression,the method takes many kinds of characteristics of the data items into account, it usesthe label attributes and attribute values to locate, and uses regular expressions tocomplete the precise extraction.The experiments show that, this thesis proposes four website informationextraction method for news、blogs、forums and microblog pages, this methods canextract information from the massive network rapidly and accurately. They had highrecall and precision rates, and can transform extract data into structured data which arestored in the database. All these methods meet the requirements of public opinion dataanalysis system.

Keywords/Search Tags:

information extraction, automatic generate templates, Web page clustering

PDF Full Text Request

Related items

1	Research On Web Article Automatic Extraction Method Based On Page Segmentation
2	Researeh On Web Information Extraction Based On Page Structure Clustering
3	Technology For Domain-oriented Automatic Information Extraction From Semi-structured Web
4	The Design And Implement Of Web Page Automatic Categorization And Storage Management System
5	Research On Mining Structure Of WEB Page For Information Extraction
6	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique
7	Research On Web Page Classification And Information Collection
8	Research On Automatic Web Information Extraction Technique
9	Research On Web Information Extraction Based On Clustering Algorithm
10	Algorithm Research Of Information Extraction And Its Application In Scientific Research And Service System