Font Size: a A A

Research On Structured Data Extraction From Web Forums

Posted on:2011-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:M GuanFull Text:PDF
GTID:2178360305451060Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Nowadays with the rapid development and popularization of the Internet, web forums have become an important data resource on the Web. It provides people with a lot of highly valuable knowledge and information. As a result, in recent years more and more research efforts trying to use information extracted from forum data to build various web applications. In order to use the forum data effectively, the fundamental step in most applications is to extract structured data from forum pages, then further exploit forum data to achieve various functions.Forum structured data extraction is the meta-data extraction from web forums such as post title, post author, post time and post content. It is the foundation of processing forum data. Because of both complex page layout designs and unrestricted user created posts, extracting structured data from web forum pages is a very challenging task and not solved well. It has become a major obstacle to use forum data effectively. In this paper, we focus on some key issues of structured data extraction from web forums. Our main contributions are as follows:1. Propose an instance-based learning wrapper generation algorithm to extract structured data from web forums. The algorithm is able to start extraction from a single labeled instance and then performs extraction by comparing each new instance to be extracted with labeled instances. Only when a new instance cannot be extracted does it need labeling. So it does not require an initial set of labeled pages to learn extraction rules. Experimental results on diverse web forum sites demonstrate the effectiveness of the method.2. Propose a forum data records extraction algorithm based on automatic pattern discovery. The algorithm builds the HTML tag tree of web page, mines the data region of web page by string comparison of nodes in the tag tree, and then extracts data records from the data region. Experimental results on the list page and post page in web forum sites show that the proposed approach significantly outperforms the classical method in extracting forum data records.3. Propose a forum meta-data extraction algorithm based on production rules. Through the analysis of the structure of forum sites and pages, the algorithm extracts meta-data from data records using a set of production rules. It does not depend on specific template, thus is able to adapt to periodical changes of forum template and extract structured data automatically. Experimental results show that the proposed approach achieves high accuracy in extracting some metadata of web forums such as post title, post author, post time and post content.
Keywords/Search Tags:Web information extraction, Forum, Structured data, Instance-based learning, Web mining
PDF Full Text Request
Related items