Font Size: a A A

Studies On Schema Matching Algorithms In Database

Posted on:2013-10-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:G H DingFull Text:PDF
GTID:1228330467982728Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Schema matching is widely used in many database applications, such as Data Integration, DataSpace, DeepWeb, Data Warehouse and Ontology Merging, etc. Schema matching aims at finding the correspondences(matches) of elements between the input schemas which have the same or similar semantics. Schema matching has been studied for many decades. From early matching manually to modern matching automatically, many achievements have been obtained. To some extent, schema matching amounts to Natural Language Processing (NLP) because schema matching is a process discovering and understanding the semantics of the schema elements based on the existing knowledge. This suggests inherent difficulty in schema matching. Thus, researchers still need to pay more attention on this topic. Recently, with the development of Internet and the popularization of communication equipments, requirements for data sharing and exchanging become stronger and stronger, and this makes schema matching a current hot-spot research. Thus, studies on schema matching have both theory significance and value of practical applications.We extract the valuable statistics for relational attributes from the query logs in databases, and propose some algorithms for the discovery and the improvement of matches. Besides, we study the applications of matches in schema integration, and propose algorithms that generate multiple mediated schemas based on user preference automatically. In this paper, we only focus on the relational schemas and our works are listed in the following:(1) Discovery of Matches. First, we exploit occurring frequency of attributes to find matches. The occurring frequency of attributes in the query clause is used to construct the feature vector. We partition the vectors of different attributes into different clusters. The attributes in the same cluster hold the similar semantics. The maximal similarity threshold is used to find the exceptional attributes, and an algorithm is designed to remove these attributes. The experimental results show that our approach achieves high accuracy.Second, we employ the appearance order of attributes in schema structure of the query results to find matches. Our approach works in three phases. In the first phase, the appearance sequence is extracted from the query log, and the statistics about appearance order of attributes in the sequence is collected. Next, the matrices are used to structure the statistics. Third, we employ two scoring functions to measure the similarities between the matrices about the statistics of input schemas, and the simulated annealing algorithm is used to find the optimal mapping. The experimental results show that our approach can return accurate results.Finally, we use the statistics about the SQL statements in query logs to perform schema matching. Our approach can be divided into four phases. The first phase collects the statistics about the clauses in the SQL statement, and constructs the clause association graph (cag). Then, we generate the set of the map pairs each of which represents a pair of attribute sequences. Third, two algorithms are designed to decompose the map pairs into single matches, and the threshold is used to choose the optimal mapping. The performance is tested by extensive experiments, and the results show that our approach is effective and accurate.(2) Improvement of Matches. We propose the algorithm for the improvement of schema matching with respect to the case that source instances include implicit categories. We detect the implicit categorical semantics from source instances, and associate it with the matches to improve the quality of matches. Our approach works in three phases. First, we detect the possible categories of source instances. Second, the entropy technique is used to remove the interference attributes to obtain the real categorical attributes. Finally, we use a new concept called c-mapping to perform the association between the categorical semantics and the matches. The experimental results show that our approach performs well.(3) Applications of Matches. The final goal of schema matching is to solve practical problems. Thus, we study the application of schema matching in schema integration, and propose an approach for the generation of multiple mediated schemas based on the user preference. The reference schema introduced can guide the integration to generate the mediated schemas according to user preference. We use the attribute density and F-Measure to measure the similarities between candidate schemas and the standard schemas. Based on the similarity, we design a top-k algorithm for the generation of k mediated schemas that user really require. The experimental results show that our approach has good performance.
Keywords/Search Tags:schema matching, query log, occurring frequency, cluster technique, mediatedschema
PDF Full Text Request
Related items