Font Size: a A A

Optimizing Of Extraction Rules And Expressing Of The Rules With XQuery In Web Information Extraction Systems

Posted on:2004-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:S F ChenFull Text:PDF
GTID:2168360122461102Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As Internet rapidly developing, World Wide Web has already become the most important and potential information resources for global broadcasting, sharing scientific, educational, commercial information and entertainment. Web documents, which marked by HTML, aim at representation and lack of schema and semantic information, so XML and related specifications have emerged to solve these problems, such as efficient management and retrieval Web information. XML separates the semantic information from the representation and has become the standard of information exchange. But the most of valuable Web information are and will be in HTML form. So, in order to access the Web information with structured and uniformed way, people apply information extraction technology to Web.In this paper, we present an improved web information extraction approach based on the analysis of the factors affecting precision and recall ratio. After the typical systems are researched and the foundation of extraction and mismatches between the semantic schema structure and the web page structure are thought over deeply, the approach introduces three levels of correlative extraction rules including raw rule, optimized rule and the rule based on XQuery. Firstly, system generates the raw rule composed of rule segments, under the help of the user. Secondly, system induces the optimized rule expressed with standard XPath for every semantic object in semantic schema from raw rule, during which the negative instances and the mismatches between the web page structure and semantic schema structure user defined are considered to improve performance. Thirdly, all the optimized rules of semantic objects are automatically assembled into one XQuery statement as the extraction rule of the complicated object. Lastly, system utilizes XQuery engine to execute XQuery statement to similar-structure Web pages. The approach improves the robusticity and efficiency and solves the mismatch between the two kind of structures. The data format of the output is flexible, since the limited XML, compatible with IDL, is taken as the semantic model. The approach, expressing extraction rule with XQuery, can be easily integrated into other Web-based applications. The extraction can, in fact, process theoperations of projection and selection. Experiments indicate that our approach is practical and effective.
Keywords/Search Tags:XML, Information extraction, Semantic schema, Rule optimization, XQuery
PDF Full Text Request
Related items