Design And Implementation Of An Uncertain Data Integration Tool

Posted on:2017-04-08

Degree:Master

Type:Thesis

Country:China

Candidate:K Huang

Full Text:PDF

GTID:2348330503489893

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With development of information technology, uncertain data is paid high attention by academic and industry. Many applications have storing large amounts of uncertain data.While the applications of enterprises and departments are mutual independence, uncertain data can't be exchanged and shared, and form �information islands'. Thus it is important to integrate uncertain data. Uncertain data has probability to represent the reliability of data.Data integration needs to do schema matching and omit duplicate records, while data uncertain make more difficult for data integration.Uncertain data integration need to handle probability, which is different from traditional data integration. A tool is designed and implemented for uncertain data integration on the bases of researching requirement. It consists of schema matching and similarity record process modules. In the schema matching module, an instance partition method is used for schema matching. The data is divided into string type and number type. For string data type,minimum average edit distance is used to divide data into different partitions, while number data, adjacent average distance is used. Through partition, information entropy is used to compute the similarity of attributes, the average of similarity of all attributes represents schema similarity, and it is an interval probability from 0 to 1, and it is used for user to decide whether schema is matching or not. In the similarity record processing module,multi-thread concurrence method is used for similarity detection. One thread executes detection and the other executes cluster. At same time, a modified uncertain data similarity record detection method is proposed. This method is improved from max probability method,and it sorts tuple in the c-table according to probability field, compares tuple from big to small. If similarity of two tuple is bigger than threshold, these two tuples are added into cluster. Detection terminates while all c-table is detected. Similarity record combination combines multi-c-table to one c-table. If there exists data of two tuple equality, combine the probability of them. Dempster-Shafer probability combination theory is employed to combine the probability of two tuple.An experiment is designed for testing the accuracy of schema matching and similarity record detection, the percentages of recall and precision are used as standard of the accuracy.In the schema matching, compared to current method, which includes meta data, duplication records and cluster method, the results indicate the instance partition method are moreaccurate. In the similarity record detection, the experiment contrast max probability method,the result indicate modified max probability is more accurate.

Keywords/Search Tags:

Data Integration, Uncertain Data, Schema Matching, Duplicate Detection, Data Combination

PDF Full Text Request

Related items

1	Research On Key Issues In Deep Web Data Integration
2	Research On Schema Matching Technology Supporting Massive Heterogeneous Data Integration
3	Research On Duplicate Detection And Cleaning Of Uncertain Data
4	Research On Key Technologies Of Equipment Support Heterogeneous Data Integration And Design Of Integration Environment
5	Research And Application On Technology Of Deep Web Schema Acquisition
6	Design And Implementation On The Quality Assessment System Of Uncertain Data
7	Research On Technology Of Schema Matching Between Global Schema And Local Schema
8	Research On Global Schema Construction In Web Data Integration
9	Interactive Data Integration Methods Based On Internet And Crowdsourcing
10	Research On Capturing Both Types And Constraints In Data Integration