Font Size: a A A

Research On Data Cleaning Method Based On Related Dependencies

Posted on:2022-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:S Q DuFull Text:PDF
GTID:2518306746473864Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Data quality is one of the most important issues in data quality management.In era of big data,data is becoming more and more influential.Governments and enterprises can guide their decisions and decide their development direction by analyzing data.However,real-life data are often dirty because data collection tends to introduce errors in data,which leads to generate biased analytical results and decisions for governments and businesses.To prevent the decision mistakes and economic losses caused by dirty data,data cleaning technologies should be updated to adapt the needs of big data.Data dependencies are usually used in the field of data cleaning.A new form of data dependency which called related dependency and a data cleaning method based on related dependencies are proposed in this paper.The main research contents of this paper are as follows:(1)Data dependency is an important tool for data cleaning.Functional dependencies and conditional functional dependencies are two widely used dependencies,which can represent various relationships between data.However,there are still many relationships in data that cannot be found by existing data dependencies,which affecting the effect of data cleaning.In order to discover and represent more relationships between data and apply them to data cleaning,this paper deeply studies the theory of data dependency and designs a new form of dependency,called related dependency,which can discover a wide range of relationships between data.In practice,a fast algorithm is proposed for discovering related dependencies.(2)Compared with functional dependencies and their extended dependencies,related dependencies have more expressive power to overcome the limits of previous dependencies and enough structure to satisfy wider need of different applications.Based on the definition and characteristics of related dependencies,this paper proposes a data cleaning method by analyzing application scenarios of related dependencies in the field of data cleaning and introduces the concept of related dependency violations.In addition,a data cleaning method based on related dependencies,and the workflow of this method are described in this paper.(3)Finally,through a large number of experiments on the real datasets and synthetic datasets,the results show that the algorithm of discovering related dependencies and the data cleaning technology based on related dependencies are stable and effective enough to discover a lot of potential relationships between data and utilize the relationships for data cleaning which get a good performance.
Keywords/Search Tags:Data Quality, Data Cleaning, Data Dependencies, Related Dependencies
PDF Full Text Request
Related items