Font Size: a A A

Research On Schema Matching Technology Supporting Massive Heterogeneous Data Integration

Posted on:2014-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:C L ZhaoFull Text:PDF
GTID:2268330425491804Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of network and information techniques, numerous large and heterogeneous datasets emerge as the times require. In order to employ these heterogeneous data, people usually use data integration. And schema matching is the core technology of data integration. However, these data sets are always with typical heterogeneity and may have problems such as semantic opacity, duplicate records, missing values or lack of schema information, which led to inapplicability of traditional schema matching technology.In response to these problems, this thesis focuses on the study of schema matching in situations where the schema information is complete and unknown, and proposes two schema matching methods suitable for different circumstances respectively.For data with complete schema information, this thesis puts forward a schema-oriented schema matching method based on semantic information and functional dependencies. This method starts its matching process from the analysis of semantic information and structure information of schema elements. Firstly, it calculates semantic similarity with the help of WordNet as primary screening condition and gets candidate matching sets of the elements to be matched. Then it utilizes functional dependency graph to describe structure information of schema precisely, and considers intergenerational dependency relationship to mine deep structure information of schema and compute the structural similarity between schemas. Finally, by analyzing semantic similarity and structural similarity, it generates probability factor dynamically and adaptively to adjust preliminary results, and ultimately screen out comprehensive and rational mapping relationship between attribute elements to realize flexible and efficient schema matching in the case of having complete schema information.With regard to missed or incomplete schema information caused by missing or invalid design document of database designers, database evolution, restricted access and other realistic problems, this thesis proposes a data-driven schema matching method based on information theory. The method is totally based on the common properties and characteristics of the data distribution it contains, which is without assumption of the existence of any external knowledge. First of all, in order to compute similarities between attribute columns, it defines an information theoretic model learning from existing information theory to represent the data distribution characteristics of attribute columns and relations between them more fine-grained. After that, it presents an algorithm for constructing original data distribution graph to describe the relationship between attribute formally. And then, it gains evolutive data distribution graph by means of analysis and transformation of original data distribution graph. It can cluster original data more accurately and detect attribute columns which are likely to match, and finally achieve schema matching.At last, the thesis performs plenty of experiments on the real and simulated datasets, and the experimental results vertify the fesibility and effectiveness of the proposed methods. The two methods put forward in this thesis are respectively applicable to schema matching with complete schema information and without schema information, which can solve the problem of schema matching comprehensively and precisely and finally meet the requirements of practical applications.
Keywords/Search Tags:schema matching, schema-oriented, functional dependency, data-driven, information theory model
PDF Full Text Request
Related items