Font Size: a A A

Key Problem Research Of Data Quality In Big Data

Posted on:2016-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:L FanFull Text:PDF
GTID:2308330473958500Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Along with big data, the problem of data quality is attracting researchers’ attention. To improve the data quality, the problem of data inconsistency should be solved. The problem of data inconsistency often occurs when there are duplicates in the data distributed in different data nodes. It is very important in data cleaning to find ideal results from the mass of inconsistent data. Using clustering algorithm, we can distinguish big errors from datasets, concentrate similar objects and then improve the data quality quickly. In this dissertation, a variety of algorithms are given. The notion of a cluster, as found by different algorithms, varies significantly in its properties. The dissertation analyzes the existing clustering algorithms systematically, and select the clustering algorithms for solving the problem of data inconsistency.Nowadays, the characteristics of data include volume, variety, velocity and Value. Data quality is the foundation of the applications based on datasets, such as data mining. Effective algorithms should be presented to improve data quality. The dissertation researches on Hadoop and the Map-Reduce program framework. Combining with the existing algorithms based on Map-Reduce applied in different fields, the clustering algorithm based on Map-Reduce is proposed in this dissertation to solve the problem of data inconsistency in big data.In this dissertation, we mainly analyzes the K-means and K-medoids clustering algorithms, and propose the E-medoids clustering algorithm, improving the efficiency and applicability of solving the problem of inconsistency in character data. At the same time, the EW-medoids clustering algorithm is proposed to enhance the accuracy of clustering algorithm. At the last, we simulate the experiment on the Hadoop platform. The experiment results evaluate the concurrency and effectiveness of our algorithm in big data.The contribution of the dissertation:1) Proposing the clustering algorithm based on Map-Reduce to solve the problem of data inconsistency in big data environment.2) Improving the K-medoids clustering algorithm in the Map-Reduce framework to enhance the applicability and accuracy.
Keywords/Search Tags:big data, data quality, data inconsistency, Map-Reduce, K-medoids, clustering algorithm
PDF Full Text Request
Related items