Font Size: a A A

Research On Web Information Extraction For Domain In Information Integration System

Posted on:2009-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2178360272978140Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the explosion of World Wide Web content, it is an urgent problem to retrieve the information correctly and quickly, and to utilize them effectively. In IIS(Information Integration System), how to integrate various heterogeneous data-sources and provide the upper layer with the unified interface of data service, which is the most abroad, hugest and real-time data, is the most urgent problem for the Web information extraction system.The topic of the thesis covers two parts: the construction of Web page extraction rules and the extraction framework system. This paper proposed and implemented the framework of information extraction system for domain information, adopting the method based on DOM and NLP adaptedly. The kernel of Wrapper is the extraction rules. The DOM based extraction method proposed in this thesis introduced a way using standard XML technology to operate Web content, extracting data through the rules generated by induced learning, then the rule parser is executed and the information items are extracted. The extraction experiment verified the feasibility of this method. However, the method above may not valid to the Web pages that are not the style of data-guided, therefore, the extraction method by NLP is proposed. Adopting the latest research results in NLP realm, the data-sources are combined with the tags in Web pages and preprocessed by word split/classification. Using the event-trigger pattern, the semantic distance of information items to be extracted is calculated. By extraction experiment, the feasibility of this method is verified. The NLP extract method remedies the fault of DOM reflection. In this system, the data-sources are preprocessed, and the rough block is detected and extracted with information entropy theory. Domain ontology are used to describe the in bottom layer, and they are reflected to the up-layer decision information, therefore, it is convenient to change domain. The extraction result saved in the database, providing the other modules for further use in IIS, as well as with the extracted information ontology set.The extraction experiment results of the Web pages in domain information show the correction of extraction algorithms and validation and usability of system framework, and it has the future of extensive research and commercial application.
Keywords/Search Tags:Information Integration, Web Information Extraction, Extraction Rules, Extract Framework
PDF Full Text Request
Related items