Font Size: a A A
/

Research On Key Technologies Of Deep Web Data Integration

Posted on:2011-08-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:1118360305953689Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the web, more and more information has been transferred from static web pages (that is Surface Web) into web databases (that is Deep Web) managed by web servers. Public information on the Deep Web is currently 400 to 500 times larger than that of Surface Web, compared with Surface Web, Deep Web not only contains higher quality information, but also constitutes the fastest growing web data carrier, therefore, it has become an urgent task to accelerate research on Deep Web.The purpose of integrating Deep Web databases is to make full use of information to a multitude of online databases as much as possible. In the oriented-domain information retrieval, the relevant domain knowledge generally helps to improve the search results. Ontology as the foundation of knowledge processing, a modeling tool which describes concepts and concept relationships in semantic level, is widely used in information retrieval. Therefore, this paper studies Deep Web searching techniques by systematically employing ontology, the main contents are as follows:(1) Construct domain ontology: Ontologies use the hierarchical tree structure to describe logical relationships between concepts, user's queries and relevant data can be mapped to the concept model. Domain ontology can be seen as a knowledge system which describes domain-specific concepts and concept relationships. Because of the different considerations for various subject areas and specific projects, the process of constructing domain ontology is different. Generally, one of the most influential is T.R. Gruber proposed five criterias in 1995: clarity and objective, coherence, extendibility, minimal encoding bias and minimal ontological commitment. In order to improve the efficiency of ontology construction, and ensure the quality of domain ontology to some extent, this paper proposes a method of semi-automatic ontology construction. Firstly, construct a core ontology by ontology tool, then judge the semantic relationships between domain terms and ontology concepts by matching method, and add the matching concepts to the core ontology to finish the ontology expansion automatically. Lastly, the domain ontology is represented in the Web Ontology Language(OWL). To operate ontology is equivalent to operate the OWL file.(2) The discovery of Deep Web entries: The quality and quantity of databases appertain directly to the quality of Deep Web data integration. In Deep Web, a significant amount of information can only be accessed through the query interface of a back-end database, and query interface is the door to get information from web databases, therefore, the discovery of Deep Web entries can be translated into the issue of distinguishing query interfaces. Traditional crawlers use a strategy of breadth-first to access web pages, in this way, the crawlers will download a large number of irrelevant pages, it is not only a high cost, but also low efficiency. Therefore, this paper constructs web page classifier, form structure classifier and form content classifier by importing focused crawling and ontology, which are used to access these pages that may link to Deep Web query interfaces. This method can avoid accessing useless links and implement automatic discovery of domain-specific Deep Web entries.a) Web page classifier: namely ontology-based focused crawling, which is used to discover topic pages. The knowledge base of domain ontology is a key component of web page classifier, essentially an object set of topic knowledge. During the process of focused crawling, web page classifier calculates the similarity between pages and domain topic to filter the useless pages.b) Form structure classifier: It is used to parse the related pages and judge whether these pages subsume query forms by Decision Tree algorithm, and then add these relevant pages to form database. It is directly related for form structure classifier to the existence of real databases.c) Form content classifier: It is used to recognize the query entries of domain-specific Deep Web in the semantic level, and store these URLs of domain-specific query interfaces to a database which will be called by other modules.(3) Schema extraction of Deep Web query interfaces: From the observation of HTML, we find that web designers usually merge a number of visual features to make the form structure more clearly and better, such as position features, layout features and appearance features, which are very important to associate attribute labels and query controls. Therefore, this paper proposes an algorithm of schema extraction based on visual features of pages automatically to obtain the interface schemas. The process of schema extraction for Deep Web query interfaces is as follows:a) Parse query interface region and obtain the form elements of query interface region. Firstly, analyze query interface region by calling HTML parser to self-define the process of parsing. During the process of parsing, we capture information for four categories, which are named start tag, end tag, simple tag and text tag. The information which does not belong to the four categories will be filtered automatically to obtain the form elements of query interface region.b) Attribute extraction: During parsing query interface region, we assign name, label, type and value for each category tag. However, when the designer codes the form, there is usually no information between visual tag
and its end tag / only for beautiful layout, therefore, it is necessary to abandon the blank lines and rows which exists no information. In addition, some useless nodes which are not concerned to for users, such as"submit","reset","link"and so on, these nodes should also be abandoned to improve the accuracy of extraction.c) Attribute analysis: Although the attribute extraction has already abandoned some useless nodes, the extracted information can not express the semantics of query interfaces. There are different position features, layout features and appearance features for different query interfaces. During the attribute analysis, we should make full use of these visual features to associate attribute labels and query controls to generate the logical attributes for each query interface.(4) The integration of Deep Web query interfaces: The integration of Deep Web query interfaces is to provide a unified access to disparate relevant sources, attribute analysis is the most important channel to integrate query interfaces, which can be generated automatically by exploring the matching relationships from the schema information and semantic information of different query interfaces. Therefore, in this paper, a novel method of integrating query interfaces is proposed based on ontology technique, which mainly subsumes two fields: schema maching and schema merge.a) Schema matching: In order to improve the matching performance, during the process of attribute analysis, we use conception mapping technology of domain ontology to discovery the semantic relationships among attributes of different web interfaces. In this way, we can make the same doamin knowledge from multiple query interfaces standardization and harmonization, and achieve the matching process of different query interfaces.b) Schema merge: According to the results of schema matching, schema merge combinates the synonymous attributes from different query interfaces, simultaneously, it maintains the attribute sequences and structural characteristics of domain-specific. Finally, we generate a global integration interface by sorting the concept matching frequencies in descending.(5) Automatic filling forms of Deep Web query interfaces: In Deep Web, a significant amount of information can only be accessed by filling query interface of a back-end database. Each query interface is composed of a group of domain-related attributes, the user can fill in his/her requirements in the integration interface and all the underlying sources will be automatically filled and searched. In order to convert the query plan from source form(integration form) to the target form(local interface), we need to handle the following problems:a) Predicate recognize: The format of synonymous predicate from source and target query interfaces may show different expression, therefore, we need to coordinate the matching relationships of predicates from different query interfaces. This paper exploits the"bridging"effect by importing ontology mapper, which bridges the relationships between user query and various databases.b) Predicate mapping: It is used to reflect the queries which users fill in integration interface according to different matching type to achieve mutual conversion of queries, which meets the query syntax constraints of each local form. Therefore, this paper implements the mapping mechanism based on data type. The method of predicate mapping based on data type provides a platform for semantic mapping, simultaneously, can easily be extended.c) Query rewrite: It is used to find the closest query condition to source query in the target form, that is to say, the target range is exactly a minimum coverage of the original one from corresponding search space. Query rewrite based on type can be seen as a search problem, which can deal with queries by different type processor. Each type processor implements a search for the type-driven mechanism to translate the queries into suitable formats for local interfaces.d) Query submission: The parameter"action"is the receive path, which is sent to server by submitting the query requests. After query rewrite, when users trigger"Submit"button, the web browser will code the data which user fills in source form to construct URL address with parameters, lastly, sending the coded URL to web server.To sum up, this thesis has a detailed research on Deep Web data integration, however, the key technology is not yet very mature, much work is still at the exploratory stage, which will be further improvement and innovation.
Keywords/Search Tags:Deep Web, ontology, schema extraction, schema matching, schema merge, query translation, focused crawling
PDF Full Text Request
Related items