Font Size: a A A

Design And Implementation Of Web Information Extraction System SEU-WIE

Posted on:2007-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2178360212465628Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, World Wide Web has become to be a huge space of distributed information. But users can not get the information they need quickly because the inherent property of Internet, which is opening, dynamic and heterogeneous framework. It became to be a difficulty that how to get the information one need quickly and exactly form the huge information resource. As a solvent, all kinds of Web information extraction technologies come into bring. But all of them have the limitation in the applicant.This paper has researched the Web Information Extraction technology, analyzed the requirement of the project, then designs and implements of the Web Information Extraction system SEU-WIE which developed by us. The system takes the Extraction rules definition and the Extraction rules execution apart, and has a user-friendly interface. The system has the generality and flexibility. There are two parts in the system, the definition of the Extraction rules and the execution of the Extraction rules. In the phase of the definition of the Extraction rule, first introduce how to transform data represented by HTML to the well-formed XML document and how to get the DOM tree of the XML document. Then user specify the location of the information which will be extracted and map it to the target table to define the Extraction rules. In the phase of the execution of the Extraction rules, first the system gets the data block by xpath in the Extraction rules which defined by user, then gets the ontology information and extracts the data with the algorithm of IEOntoMatch. Finally, stores it in a structured way.The paper also introduces the research the pre-processing. The datas extracted from Web have all kinds of problems in the quality of the data. So the datas should be cleaned, transformed, integrated and etc.
Keywords/Search Tags:Web Information Extraction, Extraction rules, XML, DOM, ontology, IEOntoMatch, data pre-processing
PDF Full Text Request
Related items