Font Size: a A A

Research On Mapping Of Heterogeneous Data Integration

Posted on:2009-11-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:J J MiaoFull Text:PDF
GTID:1118360278456590Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data integration is the basis of the information integration technology. With the continuous increasing of the information utilization, the large-scale heterogeneous data integration has become a hot issue in the information research. The mapping technology is the key to establish the consistency among the heterogeneous data, including the consistency of data model, the consistency of data instance and so on. This dissertation focuses on making a deep research on the mapping and matching technologies to maintain the consistency among the heterogeneous data. By introducing the technologies of machine learning, natural language processing, as well as the theory of fuzzy model, we improve the schema mapping approach and the instance matching approach while optimize the broken mapping detecting algorithm. In practice, we extend the platform of heterogeneous data integration (StarEAI), and finally we verify our approaches with the real-world widely used applications. This dissertation makes four contributions as follows:Firstly, to address the consistency issue of schema level, we proposed an Instance-based Multi-Strategy Schema Matching Approach (MSMA). In the schema mapping research, we are supposed to use the information of schema and other descriptions, along with the characteristics of instances,, to identify the relation between different schemas. There are rule based and machine learning based approaches to tackle this problem. Examining the existing mapping approaches, we can draw a conclusion that they build the decision model automatically or artificially. The machine learning based approach is more adaptable. A single leaner determine whether the relationship is established by a certain type of information available, but the multi-strategy approach refers to considering a variety of information. Consequently, the multi-strategy approach can increase the utilization of information, thus it can improve the accuracy of mapping. MSMA designs a number of learners to grasp the information of instances, and improves the multi-strategy approach. The experimental results show that the precision of MSMA is up to 89%, and the recall of MSMA is up to 93%. As to the pattern of lack of schema information, MSMA has more precision of the original approach.Secondly, considering the consistency of instance level, we come up with a Holistic Data Instance Matching Approach (HIMA). The heterogeneous instance refers to the same entity in different data sources, which has different descriptions. The instance matching approach can eliminate the heterogeneous data. Firstly, we measure the similarity of instances with the algorithm of string distances. The condition probabilistic based algorithm can improve the accuracy of the whole approach. From the perspective of framework, the traditional methods can just take two input data sources, and perform the pair-wise matching. HIMA makes use of the clustering algorithm, which it can handle, a large scale of data source holistically. In addition, we use the keyword extracting method, which is based on the maximum entropy model, to get rid of the useless information. The experimental results show that the keyword extracting algorithm can get 70% precision, and the condition probabilistic based algorithm is more precise than the token-based algorithm. HIMA method can achieve 83% accuracy.Thirdly, to process the run time broken mapping detecting issue, we put forward a Fuzzy-based Broken Schema Mapping Detecting Approach (BSMD). In this dynamic distributed environment, the data sources trend to suffer changes that invalidate the mappings. Such continuous monitoring is extremely labor intensive, and poses a key bottleneck to the widespread deployment of the data integration systems. The kernel of BSMD is a set of computationally inexpensive modules called sensors, which capture salient characteristics of data sources, like Maveric system. We develop two novel improvements: Disjunction-Weighted Average Operators are leveraged to calculate the score, which implies whether the mapping is broken; Change Weight Operators is introduced combine artificial data with real data in the training phase. The experiments over the real-world sources demonstrate the effectiveness of our fuzzy-based approach over existing solutions, as well as the utility of our improvements.Finally, based on the above-mentioned studies, we extend the platform of heterogeneous data integration (StarEAI), which is the outcome of an 863 project. We extend this platform with tree modules: the automatic schema mapping module, the instance matching module, as well as the broken mapping detecting module. The StarEAI+ system has been successfully deployed in the projects of armed forces and network monitoring.
Keywords/Search Tags:data integration, heterogeneous data integration, schema mapping, instance matching, broken mapping detecting, multi-strategy learning
PDF Full Text Request
Related items