Font Size: a A A

Research Of Web Information Extraction Technology Based On Semantie

Posted on:2010-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:W L HuangFull Text:PDF
GTID:2178360272479351Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As a global information space, Web contains tremendous intrinsic value, how to extract the information that user need exactly from complex data becomes a very important issue. Although a great deal of research have been carried out for web data extraction, existing technology is lack of description to data itself and never contains clear semantic information, pattern is not specific neither, which is difficult to fit the web's characteristic of diversity in structure and pattern, which makes application program cannot analysis and make use of the mass information on web directly which causes huge waste.This article introduces Suffix Tree technology coupling with data characteristic of a knowledge intensive web site, extracts available data pattern, creates domain Ontology with Protege tools realizes semantic information expansion in the process of information extraction and eliminates the isomerism of homogeneous message source of web site by use of method for establishing based on semantic.This article emphasizes the implementation technique of information data extraction of knowledge intensive web site based on the overall solution of Ontology technology and Semi-Structured web mformation extraction technology. This article puts forward an Ontology-driven information extraction pattern information-positioning by file structure and Feature Matching via analysis and research on fundamental principle of conventional method of information extraction, technology and development status, details design thinking of the pattern and the flow of extraction.This system resolves the isomerism problem among the web files by gaining the specified HTML web page in the first place, transferring the web page to well-formed XML file based on the file Converting arithmetic based on Stack Structure and link Structure, then extracting data pattern from the XML file with Suffix Tree technology, increasing semantic information for these information by use of Ontology establishing method, makes a formal description for the domain Ontology with OWL which is Web Ontology Language, generating extraction rule-base, realizes the transition from data extracted to rdf data model which containes semantie information.This thesis realizes semantic information affixation via application of Ontology, finishes data pattern extraction of web site structure by use of Suffix Tree technology. The job realizes the pattern extraction of information data source on knowledge-intensive web site, which can help user discover valuable information resource on web and provides a effectual tool to make use of the mass data on web at the same time.
Keywords/Search Tags:Suffix tree, Semantie, Web information extraction, Ontology, XML
PDF Full Text Request
Related items