Font Size: a A A

Similar Repetitive Record Detection Method In Uncertainty Database

Posted on:2015-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2208330431469167Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In concordance with the continuous development of information technology and the deepening of informatization construction, in the fields of economics, telecommunications, biology, Web applications and numerous others, a great deal of uncertainty still exists in data. Uncertain data is comprised of "dirty data" such as approximately duplicate records, missing values and erroneous data, and can affect the quality of the data to a certain extent in uncertain databases. Data cleaning technology can manage the "dirty data" generated in uncertain data, thereby improving the data quality and laying the foundation for a correct decision to be made. Many approximately duplicate records exist in uncertain databases and has not yet been studied for duplicate record detection in an uncertainty database. Therefore, this thesis will introduce a method that focuses primarily on character type data and the detection of duplicate records in uncertainty database without dependencies.In uncertain databases, uncertainty in the data can be divided into two levels: uncertainties on the attribute level and tuple level. In accordance with this division, this thesis will divide the work of duplicate records detection into two steps:the attribute similarity calculation and the records similarity calculation.1. In order to calculate the attribute similarity, this thesis will introduce the definition of weights on the attribute level based on the existed distance function method. Then, in combination with the uncertainty of the attribute-level in uncertain database, will present a distance function matching algorithm with probability (PMDU, Probabilistic Matching-Distance in Uncertain data).2. In order to calculate a record’s similarity, this thesis will also introduce the definition of weights on the tuples-level based on the superposition of uncertain factors. Each calculated attribute similarity value will be consolidated and take into account the uncertainty of records in the uncertain database. Finally, this thesis will present a superposition algorithm of contribution with probability (PCSC, Probabilistic Contribution Superposition Calculator). 3. In order to judge whether two records are duplicate, this thesis will compare the records similarity with a pre-set threshold. Then, based on the comparison results, will determine whether or not the two records are duplication.4. We generated an uncertain database and used this uncertain database’s records to implement and test the effectiveness of the proposed methods PDMU and PCSC as to verify the detection of duplicate records. Experimental results show that the methods proposed are usable, accurate and efficient.
Keywords/Search Tags:Uncertain data, Data cleaning, Approximately duplicate records, Editdistance, Contribution superposition
PDF Full Text Request
Related items