Font Size: a A A

Research On Web Information Extraction In Information Integration

Posted on:2008-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:J JiangFull Text:PDF
GTID:2178360212974593Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the explosion of World Wide Web content, it is an urgent problem to retrieve the information quickly and utilize them effectively. This thesis introduces the WS-IIS Information Integration System, which integrates various heterogeneous data sources, Web services, as well as WWW content, and provides a unified view for the upper applications. As a supplementary part of WS-IIS, Web information extraction subsystem provides a way of extracting Web sites information and constructing corresponding Web Service for WS-IIS.The topic of the thesis covers two parts: the construction of Web page extraction rules and the extraction flow. Web pages are usually described by HTML, which focus on the presentation rather than the data. As a result, a program called Wrapper is needed to extract the information on Web. The kernel of Wrapper is the extraction rules. The DOM based extraction method proposed in this thesis introduced a way using standard XML technology such as XPath, XSLT to operate Web content. The extraction experiment verified the feasibility of this method. The robustness of different extraction rules is discussed and the key element of constructing robust Wrapper is introduced. Extraction rules extract the information from web page into the destination schema, however, this is not enough, and the Web information extraction still has many new challenges. To fulfill the need of information extraction and information integration in the Internet environment, the thesis introduces the information extraction flow and the flow-based Web information extraction framework. The wrapper programmed by a specified language is substituted by the extraction flow descried by XML, which is executed by the flow execution engine. On the base of the engine, a framework encapsulating individual extraction flow as a Web Service is proposed. Because of the extraction task described by custom extraction define language, the flexibility and scalability of Wrapper is increased.
Keywords/Search Tags:Information integration, Web information extraction, Web Service, Extraction flow
PDF Full Text Request
Related items