Font Size: a A A

Design And Implementation Of Web Information Extraction Rules

Posted on:2014-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LiuFull Text:PDF
GTID:2248330395995489Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Web has become the world’s largest information source and contains a wealth of valuable information. Web information extraction technologies study how to accurately obtain information of interest to users or applications from web pages. The majority of existing Web information extraction researches focuses on automatically analyzes and extracts data records from given web pages. They overlook the browse-navigation process to get the target web pages, as well as the record integration process after the extraction process. Counter the defect of the existing researches, in this paper we fist comprehensively study the Web information extraction model. The model takes account of three processes:browse-navigation, data record extraction and data integration.In browse-navigation stage, this paper proposes a web browse-navigation model which can describe user’s browse-navigation interactions on web pages. Then, by replaying the web page browse-navigation actions in the web data extraction runtime, the system can follow a series of web page navigation links to access the target pages. In data record extraction stage, this paper will study a data extraction model that comprehensively treats varies complex web data records including irregular data records and regular data records. In data integration stage, we will study the XML-based hierarchical target data model which allows users define various target data structures and integrate original data extracted from web pages into the structure specified target data records through some data conversion and mapping method.On the basis of the above Web information extraction model, this paper designs an extraction rule language that comprehensively describes browse-navigation logic, data record extraction logic and data integration logic. Particularly, in Web data extraction model we generalize three types of regular data records:row-based records, column-based records and grid-based records. We devise an extraction rule system that can treat these three types of regular data records.Data records extracted by structural extraction rules often contain some coarse-grained unstructured or semi-structured text data and these coarse-grained text data need further filtration and extraction. Therefore, we introduce the text extraction rules to extract fine-grained data elements from coarse-grained text data. This paper proposes an approach which automatically generates text extraction rules bases on small sample learning. This approach uses a multiple sequence alignment method. To reduce the time complexity, this paper proposes a center core multiple sequence alignment algorithm. On the basis of multiple sequence alignment, we further introduce the concept of information entropy to measure the consistency of alignment’s result to identify data columns and template columns. Finally, a text extraction rule is obtained with some subsequent processing. This approach does not require any manual labeling, reduces the burden on users and improve the processing efficiency.In addition, in order to improve the degree of automation of generating extraction rules this paper utilize the results of Web data records mining to assist the generation of extraction rules. To this end, we present an algorithm to learn features of data records extracted by Web data records mining and put these features as attribute values of extraction rules.
Keywords/Search Tags:Web information extraction, Web information extraction models, extractionrule language, text extraction rule, small sample learning, multiple sequence alignment, information entropy
PDF Full Text Request
Related items