Research On Web Information Extraction In Information Integration

Posted on:2008-07-22

Degree:Master

Type:Thesis

Country:China

Candidate:J Jiang

Full Text:PDF

GTID:2178360212974593

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the explosion of World Wide Web content, it is an urgent problem to retrieve the information quickly and utilize them effectively. This thesis introduces the WS-IIS Information Integration System, which integrates various heterogeneous data sources, Web services, as well as WWW content, and provides a unified view for the upper applications. As a supplementary part of WS-IIS, Web information extraction subsystem provides a way of extracting Web sites information and constructing corresponding Web Service for WS-IIS.The topic of the thesis covers two parts: the construction of Web page extraction rules and the extraction flow. Web pages are usually described by HTML, which focus on the presentation rather than the data. As a result, a program called Wrapper is needed to extract the information on Web. The kernel of Wrapper is the extraction rules. The DOM based extraction method proposed in this thesis introduced a way using standard XML technology such as XPath, XSLT to operate Web content. The extraction experiment verified the feasibility of this method. The robustness of different extraction rules is discussed and the key element of constructing robust Wrapper is introduced. Extraction rules extract the information from web page into the destination schema, however, this is not enough, and the Web information extraction still has many new challenges. To fulfill the need of information extraction and information integration in the Internet environment, the thesis introduces the information extraction flow and the flow-based Web information extraction framework. The wrapper programmed by a specified language is substituted by the extraction flow descried by XML, which is executed by the flow execution engine. On the base of the engine, a framework encapsulating individual extraction flow as a Web Service is proposed. Because of the extraction task described by custom extraction define language, the flexibility and scalability of Wrapper is increased.

Keywords/Search Tags:

Information integration, Web information extraction, Web Service, Extraction flow

PDF Full Text Request

Related items

1	Research On Web Information Extraction For Domain In Information Integration System
2	Research On Key Issues Of Web Information Integration Oriented Web Information Extraction
3	Design And Implementation Of Web Information Extraction Rules
4	Automatic Extraction And Integration For Academic Achievement
5	Research And Implement Of Web Information Extraction Based On XML Elements Processed
6	Web Information Extraction And Integration Research Based On XML
7	The Design And Implementation Of Web Information Extraction System
8	Adaptive Web Information Extraction Method Research Based On Ontology
9	The Research And Design On Information Extraction And Gathering Model Based On XML
10	Research Of Web Information Extraction & Practice Based On Web Service