Optimizing Of Extraction Rules And Expressing Of The Rules With XQuery In Web Information Extraction Systems

Posted on:2004-05-16

Degree:Master

Type:Thesis

Country:China

Candidate:S F Chen

Full Text:PDF

GTID:2168360122461102

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As Internet rapidly developing, World Wide Web has already become the most important and potential information resources for global broadcasting, sharing scientific, educational, commercial information and entertainment. Web documents, which marked by HTML, aim at representation and lack of schema and semantic information, so XML and related specifications have emerged to solve these problems, such as efficient management and retrieval Web information. XML separates the semantic information from the representation and has become the standard of information exchange. But the most of valuable Web information are and will be in HTML form. So, in order to access the Web information with structured and uniformed way, people apply information extraction technology to Web.In this paper, we present an improved web information extraction approach based on the analysis of the factors affecting precision and recall ratio. After the typical systems are researched and the foundation of extraction and mismatches between the semantic schema structure and the web page structure are thought over deeply, the approach introduces three levels of correlative extraction rules including raw rule, optimized rule and the rule based on XQuery. Firstly, system generates the raw rule composed of rule segments, under the help of the user. Secondly, system induces the optimized rule expressed with standard XPath for every semantic object in semantic schema from raw rule, during which the negative instances and the mismatches between the web page structure and semantic schema structure user defined are considered to improve performance. Thirdly, all the optimized rules of semantic objects are automatically assembled into one XQuery statement as the extraction rule of the complicated object. Lastly, system utilizes XQuery engine to execute XQuery statement to similar-structure Web pages. The approach improves the robusticity and efficiency and solves the mismatch between the two kind of structures. The data format of the output is flexible, since the limited XML, compatible with IDL, is taken as the semantic model. The approach, expressing extraction rule with XQuery, can be easily integrated into other Web-based applications. The extraction can, in fact, process theoperations of projection and selection. Experiments indicate that our approach is practical and effective.

Keywords/Search Tags:

XML, Information extraction, Semantic schema, Rule optimization, XQuery

PDF Full Text Request

Related items

1	Research On XML-Based Heterogeneous Data Sources Integration
2	Research On The Technique Of Schema-Based XQuery Optimization And Parallel Processing
3	Research On The Schema Based XQuery Optimization
4	Study Of Query Optimization Technology Based On XML Schema
5	Research On Semantic-based Extracting Schema In XML Documents
6	Semantics-based Relational Schema To Xml Schema Conversion Methods Research
7	Research On The Update Of XQuery Views
8	Optimizing Twig Query Pattern Based On XML Schema
9	Research On Ontology-Based Web Information Extraction Technology
10	Research And Implementation Of XML Query Processing Based On XQuery And Semantic Cache