Font Size: a A A

Research On Data Source Selection Algorithm For Inconsistency Detection

Posted on:2020-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:J Y HuiFull Text:PDF
GTID:2438330575960092Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Data quality is one of the important criteria to measure the quality of data.Data quality is divided into several dimensions to evaluate data differences: consistency,integrity,accuracy,data redundancy.In many fields,such as commerce,music,sports and so on,a large number of data sources provide poor quality data and information.These inferior data will cause inconvenience to users in many aspects(such as data redundancy,inconsistency,incompleteness,etc.),leading to the reduction of data utilization efficiency.Therefore,we need a fast and effective method to detect data errors to improve the efficiency of data use.Consistency is one of the core standards of data quality.When the same attribute of the same entity has different information,the data is inconsistent.Data inconsistency will lead to data quality degradation.When data source refers to the same entity,it contains erroneous or contradictory data,which makes it more difficult to select data sources and reduces the reliability of data source.Currently,data consistency detection is mainly based on checking whether data violates dependency rules,such as functional dependency,conditional functional dependency and so on.However,it is not enough to detect inconsistency errors only by relying on rules,because there may still be errors in a data set that fully satisfies the set of dependent rules.In order to find more errors in the target data set,we consider using multiple data sources and dependency rule sets to detect inconsistent errors in the target data set.However,due to the huge number of data sources,access to all data sources will cause high storage cost,which makes the cost of inconsistency detection too huge.To solve this problem,we consider selecting k data sources from the data source set to maximize the inconsistency detection in the target data set.We call it the problem of data source selection in inconsistency detection.Common dependency rules include functional dependencies and matching dependencies.In Chapter 3,a functional dependency rule set based multi-data source selection problem is proposed.An effective signature is designed by using Bloom filter technology,so that the data sources can be effectively selected by using the signature of data without accessing the data sources.In chapter 4,aiming at matching the set of dependency rules,the problem of matching dependency rules set based multi-data sources selection is proposed.After designing the first-level signature with minimum hash technology,the first-level signature is designed with Bloom filter technology,so that the data source can be effectively selected with data signature.Both theoretical analysis and experimental results prove the correctness and effectiveness of our proposed method.
Keywords/Search Tags:Data Inconsistency, Functional Dependency, Matching Dependency, Bloom Filter, Min-hash
PDF Full Text Request
Related items