Research On Web Information Extraction For Domain In Information Integration System

Posted on:2009-12-29

Degree:Master

Type:Thesis

Country:China

Candidate:H Liu

Full Text:PDF

GTID:2178360272978140

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the explosion of World Wide Web content, it is an urgent problem to retrieve the information correctly and quickly, and to utilize them effectively. In IIS(Information Integration System), how to integrate various heterogeneous data-sources and provide the upper layer with the unified interface of data service, which is the most abroad, hugest and real-time data, is the most urgent problem for the Web information extraction system.The topic of the thesis covers two parts: the construction of Web page extraction rules and the extraction framework system. This paper proposed and implemented the framework of information extraction system for domain information, adopting the method based on DOM and NLP adaptedly. The kernel of Wrapper is the extraction rules. The DOM based extraction method proposed in this thesis introduced a way using standard XML technology to operate Web content, extracting data through the rules generated by induced learning, then the rule parser is executed and the information items are extracted. The extraction experiment verified the feasibility of this method. However, the method above may not valid to the Web pages that are not the style of data-guided, therefore, the extraction method by NLP is proposed. Adopting the latest research results in NLP realm, the data-sources are combined with the tags in Web pages and preprocessed by word split/classification. Using the event-trigger pattern, the semantic distance of information items to be extracted is calculated. By extraction experiment, the feasibility of this method is verified. The NLP extract method remedies the fault of DOM reflection. In this system, the data-sources are preprocessed, and the rough block is detected and extracted with information entropy theory. Domain ontology are used to describe the in bottom layer, and they are reflected to the up-layer decision information, therefore, it is convenient to change domain. The extraction result saved in the database, providing the other modules for further use in IIS, as well as with the extracted information ontology set.The extraction experiment results of the Web pages in domain information show the correction of extraction algorithms and validation and usability of system framework, and it has the future of extensive research and commercial application.

Keywords/Search Tags:

Information Integration, Web Information Extraction, Extraction Rules, Extract Framework

PDF Full Text Request

Related items

1	Research On Web Information Extraction Framework
2	Design And Implementation Of Web Information Extraction Rules
3	Agent-based Web Information Extraction
4	Research On Language And Key Techniques For Accurate Information Extractionrules Towards Complex Web
5	Research On Key Issues Of Web Information Integration Oriented Web Information Extraction
6	Research On Automated Web Navigation And Data Integration Rules For Web Information Extraction
7	Research On Competitive Information Extraction Based On Web
8	The Information Extraction Of Unstructured Document Extraction And Analysis
9	Optimizing Of Extraction Rules And Expressing Of The Rules With XQuery In Web Information Extraction Systems
10	XML-based WEB Information Extraction System Research And Implementation