Research On Structured Data Extraction From Web Forums

Posted on:2011-03-31

Degree:Master

Type:Thesis

Country:China

Candidate:M Guan

Full Text:PDF

GTID:2178360305451060

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Nowadays with the rapid development and popularization of the Internet, web forums have become an important data resource on the Web. It provides people with a lot of highly valuable knowledge and information. As a result, in recent years more and more research efforts trying to use information extracted from forum data to build various web applications. In order to use the forum data effectively, the fundamental step in most applications is to extract structured data from forum pages, then further exploit forum data to achieve various functions.Forum structured data extraction is the meta-data extraction from web forums such as post title, post author, post time and post content. It is the foundation of processing forum data. Because of both complex page layout designs and unrestricted user created posts, extracting structured data from web forum pages is a very challenging task and not solved well. It has become a major obstacle to use forum data effectively. In this paper, we focus on some key issues of structured data extraction from web forums. Our main contributions are as follows:1. Propose an instance-based learning wrapper generation algorithm to extract structured data from web forums. The algorithm is able to start extraction from a single labeled instance and then performs extraction by comparing each new instance to be extracted with labeled instances. Only when a new instance cannot be extracted does it need labeling. So it does not require an initial set of labeled pages to learn extraction rules. Experimental results on diverse web forum sites demonstrate the effectiveness of the method.2. Propose a forum data records extraction algorithm based on automatic pattern discovery. The algorithm builds the HTML tag tree of web page, mines the data region of web page by string comparison of nodes in the tag tree, and then extracts data records from the data region. Experimental results on the list page and post page in web forum sites show that the proposed approach significantly outperforms the classical method in extracting forum data records.3. Propose a forum meta-data extraction algorithm based on production rules. Through the analysis of the structure of forum sites and pages, the algorithm extracts meta-data from data records using a set of production rules. It does not depend on specific template, thus is able to adapt to periodical changes of forum template and extract structured data automatically. Experimental results show that the proposed approach achieves high accuracy in extracting some metadata of web forums such as post title, post author, post time and post content.

Keywords/Search Tags:

Web information extraction, Forum, Structured data, Instance-based learning, Web mining

PDF Full Text Request

Related items

1	Forum Data Based QA Mining
2	Specific Instance Detection Based Multi-Instance Learning And Its Applications To Virtual Props Recommendation
3	Research On Keyword Extraction And Structured List Data Extraction
4	The Research And Implementation Of QA Techniques Based On Forum Data
5	A Dynamic Learning Framework To Automatically Extract Structured Data From Web Pages Without Human Efforts
6	Research On Key Technologies Of Programming Forum Search
7	Integrating Forum Data Crawler With Rule-based Information Extraction
8	Research On Web Forums Information Extraction System Based On Distributed Architecture
9	Research On Instance Selection Of Structured Data Based On Reinforcement Learning
10	Research On Event Extraction Based On Structured Learning