Font Size: a A A

Design And Implementation Of An Uncertain Data Integration Tool

Posted on:2017-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:K HuangFull Text:PDF
GTID:2348330503489893Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With development of information technology, uncertain data is paid high attention by academic and industry. Many applications have storing large amounts of uncertain data.While the applications of enterprises and departments are mutual independence, uncertain data can't be exchanged and shared, and form ‘information islands'. Thus it is important to integrate uncertain data. Uncertain data has probability to represent the reliability of data.Data integration needs to do schema matching and omit duplicate records, while data uncertain make more difficult for data integration.Uncertain data integration need to handle probability, which is different from traditional data integration. A tool is designed and implemented for uncertain data integration on the bases of researching requirement. It consists of schema matching and similarity record process modules. In the schema matching module, an instance partition method is used for schema matching. The data is divided into string type and number type. For string data type,minimum average edit distance is used to divide data into different partitions, while number data, adjacent average distance is used. Through partition, information entropy is used to compute the similarity of attributes, the average of similarity of all attributes represents schema similarity, and it is an interval probability from 0 to 1, and it is used for user to decide whether schema is matching or not. In the similarity record processing module,multi-thread concurrence method is used for similarity detection. One thread executes detection and the other executes cluster. At same time, a modified uncertain data similarity record detection method is proposed. This method is improved from max probability method,and it sorts tuple in the c-table according to probability field, compares tuple from big to small. If similarity of two tuple is bigger than threshold, these two tuples are added into cluster. Detection terminates while all c-table is detected. Similarity record combination combines multi-c-table to one c-table. If there exists data of two tuple equality, combine the probability of them. Dempster-Shafer probability combination theory is employed to combine the probability of two tuple.An experiment is designed for testing the accuracy of schema matching and similarity record detection, the percentages of recall and precision are used as standard of the accuracy.In the schema matching, compared to current method, which includes meta data, duplication records and cluster method, the results indicate the instance partition method are moreaccurate. In the similarity record detection, the experiment contrast max probability method,the result indicate modified max probability is more accurate.
Keywords/Search Tags:Data Integration, Uncertain Data, Schema Matching, Duplicate Detection, Data Combination
PDF Full Text Request
Related items