Font Size: a A A

Research On Key Issues Of Web Information Integration Oriented Web Information Extraction

Posted on:2008-06-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:W TengFull Text:PDF
GTID:1118360215976787Subject:Computer applications
Abstract/Summary:PDF Full Text Request
The Internet presents a stunning variety of online information resources: recruit information, telephone directories, product catalogs, weather forecasts, and many more. The amount of data accessible via the Web is staggeringly large and growing rapidly. However, the Web information is usually formatted for we human beings rather than machines, and no provision is made for automating the process. That is, the Web's browsing paradigm does not support many information management tasks.In this paper, we will argue the key issues of the research of this topic.A brand-new lightweight wrapper to support high efficiency wrap process is argued in chapter II. The issue of modeling semi-structured data has been investigated in the recent years. In particular, the survey of ad hoc problems regarding semi-structured data modeling, querying has been pervasive. But the issue of modeling stitched for integration in a real time has not been mentioned preferably. As we know, the induction algorithm and training example collections are based on page granularity, which requires user notes the example page at first. And all the tuples have to be noted. This brings forth a tedious burden to the user and lowers the efficiency of the system. The algorithm we propose learns a wrapper by generalizing from simple example query responses. It only trains one page of data and makes the tuple as the granularity of the learning algorithm.Intuitively, in the process of the integration of Web information, due to the reason that information may come from different web sites, not only should we analyze the contents of the webpage itself, we should take the authority of the different information sources into consideration as well. Since web information resides not only at labels, also at nodes. based on rooted, labeled graph with the objects as nodes and labels on edges. By means of this, we may integrate the most valuable information into the result. In chapter III, we argue a quick evaluation algorithm of authority, pseudo parallel authority evaluation algorithm, which is based on HITS. It will expand the topology of the linkage only once instead of expanding to some extend, which will may calculate the Authority and Hub while extracting the contents of the Web pages.Other main problems to be faced are related to the identification of semantically related information, that is, information describing the same real-world concept in different sources, and to semantic heterogeneity. In fact, information sources available in global information systems are pre-existing and have been developed independently. Consequently, semantic heterogeneity can arise for the aspects related to terminology, structure, and context of the information, and has to be properly dealt with during integration in order to effectively and correctly exploit the information available at the sources.Developing intelligent tools for the integration of information extracted from multiple heterogeneous sources is a challenging issue to effectively exploit the numerous sources available on-line in global, Internet-based information systems. In chapter IV, we described our HOWNET-based ontology environment. We discuss several issues that must be overcome before ontology can become practical. We presented an inclusion model for ontology that enables users to assemble new ontology rapidly from existing ones in a repository. This model makes a clean separation between its simple formal semantics and the input/output properties of the system that uses it. The formal model handles simple inclusion, polymorphic refinement, restrictions, and circular inclusion dependencies. The input/output model yields succinct readable external representations and is transparent to users. We also described an infrastructure that enables the Ontology Server to provide access to users over the world-wide web so that users can create, edit, view, and export ontology using their own web-browsers.In chapter V, we describe some other issues of information integration. The kennel of the issue is the quick and flexible algorithm to integrate heterogeneous Information resources and construct the result dataset finally. Hence, an integration scheme of fuzzy numbers corresponding to static type knowledge and dynamic information is presented. The integration scheme, denoted by compatibility integration scheme, is content dependent and is based on the concept of compatibility between existing knowledge and information. It is motivated by the hypothesis that information serves to support existing knowledge when it is consistent with it, and to modify and update existing knowledge when the two differ substantially.Intelligent information integration, especially extracting information from multiple heterogeneous sources is a challenging issue to effectively exploit the numerous sources available online in global information systems. At last, we propose an intelligent, tool-supported information extraction and integration system prototype. Information integration is then performed in a semiautomatic way by exploiting the knowledge in the Common Thesaurus and HowNET descriptions of source schemas with a combination of clustering techniques and Description Logics. This integration process gives rise to a virtual integrated view of the underlying sources for which mapping rules and integrity constraints are specified to handle heterogeneity.It's our honor to implement the theory in this paper to a global Chemical's cooperate competitive information system which verified the value of our theory.
Keywords/Search Tags:Information extraction, Information integration, Semi-structured data, Semantic heterogeneity, Description Logics, Clustering techniques
PDF Full Text Request
Related items