Similar Repetitive Record Detection Method In Uncertainty Database

Posted on:2015-02-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y Yu

Full Text:PDF

GTID:2208330431469167

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In concordance with the continuous development of information technology and the deepening of informatization construction, in the fields of economics, telecommunications, biology, Web applications and numerous others, a great deal of uncertainty still exists in data. Uncertain data is comprised of "dirty data" such as approximately duplicate records, missing values and erroneous data, and can affect the quality of the data to a certain extent in uncertain databases. Data cleaning technology can manage the "dirty data" generated in uncertain data, thereby improving the data quality and laying the foundation for a correct decision to be made. Many approximately duplicate records exist in uncertain databases and has not yet been studied for duplicate record detection in an uncertainty database. Therefore, this thesis will introduce a method that focuses primarily on character type data and the detection of duplicate records in uncertainty database without dependencies.In uncertain databases, uncertainty in the data can be divided into two levels: uncertainties on the attribute level and tuple level. In accordance with this division, this thesis will divide the work of duplicate records detection into two steps:the attribute similarity calculation and the records similarity calculation.1. In order to calculate the attribute similarity, this thesis will introduce the definition of weights on the attribute level based on the existed distance function method. Then, in combination with the uncertainty of the attribute-level in uncertain database, will present a distance function matching algorithm with probability (PMDU, Probabilistic Matching-Distance in Uncertain data).2. In order to calculate a record’s similarity, this thesis will also introduce the definition of weights on the tuples-level based on the superposition of uncertain factors. Each calculated attribute similarity value will be consolidated and take into account the uncertainty of records in the uncertain database. Finally, this thesis will present a superposition algorithm of contribution with probability (PCSC, Probabilistic Contribution Superposition Calculator). 3. In order to judge whether two records are duplicate, this thesis will compare the records similarity with a pre-set threshold. Then, based on the comparison results, will determine whether or not the two records are duplication.4. We generated an uncertain database and used this uncertain database’s records to implement and test the effectiveness of the proposed methods PDMU and PCSC as to verify the detection of duplicate records. Experimental results show that the methods proposed are usable, accurate and efficient.

Keywords/Search Tags:

Uncertain data, Data cleaning, Approximately duplicate records, Editdistance, Contribution superposition

PDF Full Text Request

Related items

1	Research On Detection Of Approximate Duplicate Records For Massive Data
2	Research On Data Cleaning Method Based On Optimal Feature Selection
3	Research On Data Cleaning Of Approximately Duplicated Records
4	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
5	An Improved Method For Detecting Incremental Approximately Duplicate Records Based On Clustering Tree
6	Some Main Technology's Research Of Data Cleaning
7	Research On Duplicate Detection And Cleaning Of Uncertain Data
8	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
9	Research Of Data Cleaning Method Based On Data Warehouse
10	Research On Duplicate Records Identification Model In Deep Web