Extracting Informative Semantic Contents From Web Pages

Posted on:2014-05-26

Degree:Master

Type:Thesis

Country:China

Candidate:Z He

Full Text:PDF

GTID:2268330425475254

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The fact that the number of web pages grows explosively makes the modeling and extraction of semantic information from a web page an increasingly challenging job. Although semantic information plays a significant role in the fields of ontology construction, web mining and other applications, currently most semantic interpretation methods require intensive human decisions while some others are restricted to particular domains. Therefore, they are not capable of dealing with todayâ€™s vast and frequent application needs.This thesis presents a knowledge model to depict the logical view of a web page. With the help of a small amount of manually labeled training samples and applying web mining and extraction techniques, a web page is automatically turned from a stream of HTML tags and characters into a sequence of semantic blocks. The locations and functionalities of these blocks are the major semantic information that we are interested in our work.Based on repeated structures, which is a long-studied type of data with many unique features, we propose a3-step process to extract structural semantic information from a web page. In the first step, we design a compound classifier with both decision tree and SVM algorithms to identify repeated structures in the web page. In the second step, meaningful repeated structures are defined as logical blocks to segment the page. In the last step, a semantic label is assigned to each segment of the page to represent its functionality and then informative contents are extracted accordingly.Comparing to the other existing methods, the proposed model and extraction method are easy to implement. Our method is insensitive to the transfer of fields, topics and web page layouts. It does not need much manual efforts and is expected to achieve a precise block extraction result for every web page. In this thesis, we go through the proposed extraction process and explain each step in details. In the experiment section, our method is compared with two state-of-the-art systems to prove the significant value of our research.

Keywords/Search Tags:

Web Semantic Data, Semantic Blocks Annotation, Web Mining, Web Data Extraction, Machine Learning, SVM

PDF Full Text Request

Related items

1	A Study About The Process Of Automatic Image Annotation
2	Super Data The Integrated Mining Method And Technology Research
3	The Research On Semantic-driven Image Mining Using Statistical Learning
4	Automatic Semantic Annotation Method For IoT Sensory Data Research And Implementation
5	Semantic Annotation Based Traditional Chinese Medicine Data Process Platform
6	Semantic Annotation For Documents In Professional Domain Based On NLP
7	Research On Technology Of Deep Web Oriented Data Extraction And Semantic Annotation
8	A Study And Implementation Of Semantic Annotation For Chinese Text
9	Study On Data Extraction And Semantic Annotation For Specific Field Deep Web
10	Research On Key Technologies Of Semantic Retrieval Based On Multimodal Data