Research On Key Techniques Of Web Information Extraction For Online Public Opinion Analysis

Posted on:2011-06-25

Degree:Master

Type:Thesis

Country:China

Candidate:H Zhou

Full Text:PDF

GTID:2178360305970878

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Nowadays, Internet plays an increasingly significant role in people's lives. Among the Internet services, World Wide Web has become the most important service, it is important to extract suitable information for the online public opinion analysis from the vast WWW effectively. Although many methods have been proposed to solve this problem, existing Web information extraction methods have some limitations. This thesis studies the data acquisition techniques from Web crawler and automated Web information extraction.This thesis firstly introduces the background and development history of Web crawler and Web information extraction, analyzes the character and disadvantage of the existing methods.Secondly, this thesis analyzes the special requirements of online public opinion analysis for the source data, and proposes two special crawlers for forum website and blog website. An effective crawling strategy based on three-level is proposed for the forum website, and the crawler could extract information effectively only with the template of one topic, through reverse crawling strategy the crawler could solve the problem of topic updating; User-unit is the crawling strategy of the blog crawler, through mapping user id to the memory to check the existing users, with the strategy that extracting information from pages for the first time and from RSS for the updating, get information in real time through focusing on active users.Then, this thesis proposes an automated Web information extraction and classification method for the list page, which based on DOM Tree structure and text features of web pages, we preprocess the page and construct the DOM Tree of the page firstly, then extract the informative data record set from the page through mining the similar subtree from the DOM Tree, after that extract detailed information from each data record in the informative set based on the structure of the subtree, finally the detailed information is classified through extracting template text, matching class value of CSS with classification feature library, analyzing tag name and text feature.Finally, experiments show that our method can extract and classify information effectively.

Keywords/Search Tags:

Information extraction, DOM Tree, Data record, Network crawler

PDF Full Text Request

Related items

1	Research Of Web Information Extraction Technology Based On Tree Structure
2	Research On Techniques Of Automatic Data Record Analysis And Recognition For Accurate Web Information Extraction
3	Design And Implementation Of The Crawler Log Data Information Extraction And Statistical System
4	Research On Deep Web Data Acquisition Based On Visual Information And DOM Tree
5	The Application Of Data Mining Technology In The Medical Record Information Management
6	Integrating Forum Data Crawler With Rule-based Information Extraction
7	Based On The Specific Web Crawler API Weather Data Fetching In The Research And Implementation
8	Research And Implementation Of Microblog Crawler
9	Based On Templated Web Crawler Technology Of Web Page Information Extraction
10	Study Of Web Crawler And Web Information Extraction