Font Size: a A A

Research On Bottom-up Web Data Extraction

Posted on:2012-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:T LiuFull Text:PDF
GTID:2248330395958225Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of techniques, the amount of information of different fields is increasing fast. As the important media, Internet develops most. Web contains data from different data sources of different fields with various and complex form. As a result, users can hardly find information they need indeed rapidly and precisely.In order to manage information on the web effectively, we have to obtain the high quality structured data among data sources. Hence, it is necessary to extract and integrate data on the web efficiently and precisely. We propose a bottom-up web data extraction approach. In contrast with others, this method starts with attributes labeling and then build and integrate the structured data. In this paper, we call every text sequence on the web is an entity. Our approach consists of two parts, named entity extraction and entity integration. The new approach is a structuredless-depended extraction method with both higher expansibility and flexibility.The paper mainly focused on the strategy of entity extraction and entity integration algorithm, including Two-Level extraction model, repetitive pattern extraction algorithm, and pattern refinement algorithm. Two-Level extraction model divides rules into recall rules and precision rules which are designed to guarantee a higher recall and precision separately. FindPattern algorithm extracts repetitive patterns from attribute array according to the text feature on the web. In order to decrease the time spent on pattern matching, RefinePattern algorithm refines the repetitive patterns based on infinite automata. Besides, the paper does a further research on the level schema of separating the web page.Our approach is evaluated by experimental results, which proofs the bottom-up method extracts the structured data on the web effectively, superior to the traditional techniques on both recall and precision. Our approach is more expansible and scalable, which can be widely used for integrating the web data sources of different topics.
Keywords/Search Tags:entity extraction, entity integration, bottom-up, web data extraction, pageseparating
PDF Full Text Request
Related items