Font Size: a A A

Extracting Informative Semantic Contents From Web Pages

Posted on:2014-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z HeFull Text:PDF
GTID:2268330425475254Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The fact that the number of web pages grows explosively makes the modeling and extraction of semantic information from a web page an increasingly challenging job. Although semantic information plays a significant role in the fields of ontology construction, web mining and other applications, currently most semantic interpretation methods require intensive human decisions while some others are restricted to particular domains. Therefore, they are not capable of dealing with today’s vast and frequent application needs.This thesis presents a knowledge model to depict the logical view of a web page. With the help of a small amount of manually labeled training samples and applying web mining and extraction techniques, a web page is automatically turned from a stream of HTML tags and characters into a sequence of semantic blocks. The locations and functionalities of these blocks are the major semantic information that we are interested in our work.Based on repeated structures, which is a long-studied type of data with many unique features, we propose a3-step process to extract structural semantic information from a web page. In the first step, we design a compound classifier with both decision tree and SVM algorithms to identify repeated structures in the web page. In the second step, meaningful repeated structures are defined as logical blocks to segment the page. In the last step, a semantic label is assigned to each segment of the page to represent its functionality and then informative contents are extracted accordingly.Comparing to the other existing methods, the proposed model and extraction method are easy to implement. Our method is insensitive to the transfer of fields, topics and web page layouts. It does not need much manual efforts and is expected to achieve a precise block extraction result for every web page. In this thesis, we go through the proposed extraction process and explain each step in details. In the experiment section, our method is compared with two state-of-the-art systems to prove the significant value of our research.
Keywords/Search Tags:Web Semantic Data, Semantic Blocks Annotation, Web Mining, Web Data Extraction, Machine Learning, SVM
PDF Full Text Request
Related items