Research On Bottom-up Web Data Extraction

Posted on:2012-11-01

Degree:Master

Type:Thesis

Country:China

Candidate:T Liu

Full Text:PDF

GTID:2248330395958225

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of techniques, the amount of information of different fields is increasing fast. As the important media, Internet develops most. Web contains data from different data sources of different fields with various and complex form. As a result, users can hardly find information they need indeed rapidly and precisely.In order to manage information on the web effectively, we have to obtain the high quality structured data among data sources. Hence, it is necessary to extract and integrate data on the web efficiently and precisely. We propose a bottom-up web data extraction approach. In contrast with others, this method starts with attributes labeling and then build and integrate the structured data. In this paper, we call every text sequence on the web is an entity. Our approach consists of two parts, named entity extraction and entity integration. The new approach is a structuredless-depended extraction method with both higher expansibility and flexibility.The paper mainly focused on the strategy of entity extraction and entity integration algorithm, including Two-Level extraction model, repetitive pattern extraction algorithm, and pattern refinement algorithm. Two-Level extraction model divides rules into recall rules and precision rules which are designed to guarantee a higher recall and precision separately. FindPattern algorithm extracts repetitive patterns from attribute array according to the text feature on the web. In order to decrease the time spent on pattern matching, RefinePattern algorithm refines the repetitive patterns based on infinite automata. Besides, the paper does a further research on the level schema of separating the web page.Our approach is evaluated by experimental results, which proofs the bottom-up method extracts the structured data on the web effectively, superior to the traditional techniques on both recall and precision. Our approach is more expansible and scalable, which can be widely used for integrating the web data sources of different topics.

Keywords/Search Tags:

entity extraction, entity integration, bottom-up, web data extraction, pageseparating

PDF Full Text Request

Related items

1	Domain-Oriented Web Entity Expansion And Robust Optimization Of The Wrapper
2	English Entity Answer Extraction And Home Find
3	Research On Web Entity Activity And Entity Relationship Extraction
4	Research Of Entity Knowledge Base System Based On Information Extraction
5	Research On Entity Relationship Extraction
6	Research On Joint Extraction Of Entity Relations By Fusing Entity Local Information
7	Research On The Techniques Of Entity Identity On XML Data
8	Entity Analysis with Weak Supervision: Typing, Linking, and Attribute Extraction
9	Research On Sentence-level Entity Relationship Extraction With Thai Features
10	Research Of Related Entity Extraction And Homepages Finding