Font Size: a A A

Research On The Framework Of Deep Web Integration System

Posted on:2011-04-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiangFull Text:PDF
GTID:1118360305953673Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Research On the Framework of Deep Web Integration SystemSince WWW was generated in 1990's, Web has become the largest information platform of human society. We feel the information society lifestyle in the style of all kinds of information conveyed by static Web pages. With the development of computer science and technology, instead of published by static Web pages more and more information is gradually"hidden"and this kind of information can only be accessed as the dynamic returned results of user's requests. Static Web pages can be crawled and indexed by the traditional search engines, so this kind of Web information is called"Surface Web", while the"hidden"information which cannot be retrieved by search engines is called"Deep Web". The majority of Deep Web is stored in a large number of online databases as structured data. The effective way of accessing Deep Web is by receiving dynamic returned results as response to queries generated by query interface. The first released white paper of Deep Web is published by Brightplanet Company in 2000. Researchers were attracted by the amazing characteristics of Deep Web which are higher quality, faster growing and widen coverage. As a new object of information integration, the research of Deep Web is with very high research and practical value.How to effectively access useful information of Deep Web is always a challenging task. In the early Web, search engine is workable and practical solution of Web information integration. The structure of Web is a special kind of graphics which is composed of Web and hyperlinks in the Web just like nodes and links. Crawler programs search engine implement kinds of graphic traversal algorithms to crawl from one page to other pages, download pages, classify topics, establish indexing and provide response service mechanism. However, crawlers have no capability of automatic query interfaces filling function to access the information in the Deep Web. To integrate the structure information of Deep Web cannot just relay on the traditional crawlers. According to the existing results from foreign research groups, the integration of Deep Web is composed three major steps, which are preprocessing step, query processing step and results processing step. During preprocessing step, we need to analyze and process all kinds of information of Deep Web, such as the type of information objects, physical characteristics of the data sources, structure characteristics of query interfaces and the characteristic query responding. The aim of preprocessing step is to generate a global query interface through which user can get information from multi Deep Web sources after initializing query instances. The major challenges are Deep Web discovering, Deep Web classifying, Deep Web describing, query interface schema extracting and schema integrating. During the query processing step, the integration system should automatically choose appropriate Deep Web as data sources to get answers as response to users'queries. The query instance should be divided into sub-queries for each data sources and data sources execute workable sub-query to generate answers. The major challenges of this step are Deep Web selecting, query translating and query submitting. During the results processing step, the integration system will render the answers from different Deep Web in a formal style just as all the data belong to one data source after analyzing, extracting, removing duplicates and annotating. The major challenges of this step are structure data extracting, data merging and semantic annotating. In this paper, the Deep Web integration is analyzed in the perspective of traditional data integration. Web data base is considered as a new kind of integration object, the mainly difference between Deep Web and traditional data bases is that Deep Web is behind the query interface and there are not complete schema information in query interface. When facing the new challenge arisen by Deep Web, we make workable solutions based on relevant principle of traditional data integration techniques. The mainly research contents are as following:1. A new kind of Deep Web information integration framework is presented and named as mediated model, which is composed of four functional process and six functional modules. The four functional processes are data source discovering, data source classifying, schema integrating, completeness and extensibility checking. The six functional modules are global interface or global schema, query rewriting engine, query optimizing engine, query executing engine, data source indexing engine and results displaying engine. All the functional processes and modules complete detailed function and can be divided into two stages which are preprocess and service. The framework is started from Web data source discovering. Focused crawling strategy and multi-classifier can be merged to discover Web data sources and formalize the list of describing information. The Web data sources are classified based on the type of information objects after formalizing the describing information. The schemas of query interfaces are extracted and classified according to different domain relevance. The candidate integration schema is generated according to the domain related schemas of Web data sources after detailed schema information extracted. After completeness and extensibility checking, one candidate global schema may become one final global schema. However, the same information object may be described in different styles and used different information. The final global schema may be not unique, so we need analyze the domain information and system function to choose the best global schema. The best global schema is the mediated schema based on which the global query interface is generated. Users can initialize query instances in the global query interface and the query instance will be preprocessed. In the query rewriting engine, the preprocessed query instance will be decomposed into sub-queries according to the mapping between the mediated schema and local schemas of Web data sources. Each sub-query can be executed on the related Web data source. The query process stage cannot be divided into query optimizing and query executing simple which is different from the traditional database integration. When facing Deep Web integration, the available query optimizing information is limited, because of which there may be two abnormal situations occurred. First, there may not be enough information to generate a workable optimizing strategy. Second, the Web data sources may not be response to the generated optimizing strategy as expected which leads to inefficiency of query executing. The query executing engine send effective sub-queries to related Web data sources after some executable strategy generated. During the step, the executable strategy is mainly depended on data source indexing engine in which the physical performance information becomes the major considered factors. In other words, the data source indexing engine will generate a virtual Web data warehouse and also provide located information of Web data sources. After receiving returned results from different data sources, we need carry on a result process which is composed of analyzing, extracting, removing duplicates and annotating. The related and selected answers from different Web data sources will rendered to users in a formal style just like the user operate on only one data source.2. Aiming to improve the accuracy of extracting attributes and realize machine readable of query interfaces semantically, an approach is presented to extract the attributes based on heuristic information and sets of attributes are enriched utilizing Ontology to get deep semantic understanding of the query interfaces. The meaningless words are instantiated by instance information in the query interface. During extracting attributes, the form element of query interface is described as a three-tuple which are element name, element text and element instance. Ontology instrument WordNet is introduced to deal with the semantic problem in the attribute extraction process. The instance information can be used to instantiated attributes based on the semantic relation provided by WordNet. Different from traditional attribute process, we try to get semantic understanding about attributes in query interface. The method of query instance based on semantic container can be used to reconcile the conflicts between attributes Extensive experiments over real-world domains show the utility and ability of the algorithm parsing the interfaces and extracting valid attributes.3. An extended schema description method of Deep Web based on Ontology technique is presented in this paper. Query interfaces provided by the Deep Web are the clues to disclose the hidden schemas, but the complicated semantic relationships in the query interfaces lead to the lower generality and ability of local as view (LAV) method in the traditional information fusion system. To address the problems, an extended schema description method of Deep Web based on Ontology technique is presented. Semantic distances between the attributes are calculated based on the structural characteristics of Ontology instrument. The semantic relationships are evaluated by the semantic matrix and the mapping mechanism between mediator schema and local schema is built on the semantic related groups generated by the upper triangle matrix backtracking algorithm. Adjusting value of thresholdαandβ, different semantic related groups will be generated by the algorithm. The experiments were carried out on the famous dataset, the efficiency and extended ability of the extended schema description method of Deep Web based on Ontology technique is verified.4. An Ontology learning algorithm based on the Deep Web query interface increment merging is presented. Each tree structure of query interface is semantically processed and presents domain knowledge. Based on the attributes instantiating with instance, some meaningless attributes will convey right related semantic information. There are no more domain unrelated attributes will be generated when merging the concept tree structures of query interfaces. The relation between attributes and instance is stable because the domain Ontology constructing and enriching process are consistent.
Keywords/Search Tags:WWW, Surface Web, Deep Web, query interface, schema, mediated schema, Ontology, WordNet, Ontology Learning
PDF Full Text Request
Related items