Key Problem Research Of Data Quality In Big Data

Posted on:2016-03-19

Degree:Master

Type:Thesis

Country:China

Candidate:L Fan

Full Text:PDF

GTID:2308330473958500

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Along with big data, the problem of data quality is attracting researchers’ attention. To improve the data quality, the problem of data inconsistency should be solved. The problem of data inconsistency often occurs when there are duplicates in the data distributed in different data nodes. It is very important in data cleaning to find ideal results from the mass of inconsistent data. Using clustering algorithm, we can distinguish big errors from datasets, concentrate similar objects and then improve the data quality quickly. In this dissertation, a variety of algorithms are given. The notion of a cluster, as found by different algorithms, varies significantly in its properties. The dissertation analyzes the existing clustering algorithms systematically, and select the clustering algorithms for solving the problem of data inconsistency.Nowadays, the characteristics of data include volume, variety, velocity and Value. Data quality is the foundation of the applications based on datasets, such as data mining. Effective algorithms should be presented to improve data quality. The dissertation researches on Hadoop and the Map-Reduce program framework. Combining with the existing algorithms based on Map-Reduce applied in different fields, the clustering algorithm based on Map-Reduce is proposed in this dissertation to solve the problem of data inconsistency in big data.In this dissertation, we mainly analyzes the K-means and K-medoids clustering algorithms, and propose the E-medoids clustering algorithm, improving the efficiency and applicability of solving the problem of inconsistency in character data. At the same time, the EW-medoids clustering algorithm is proposed to enhance the accuracy of clustering algorithm. At the last, we simulate the experiment on the Hadoop platform. The experiment results evaluate the concurrency and effectiveness of our algorithm in big data.The contribution of the dissertation:1) Proposing the clustering algorithm based on Map-Reduce to solve the problem of data inconsistency in big data environment.2) Improving the K-medoids clustering algorithm in the Map-Reduce framework to enhance the applicability and accuracy.

Keywords/Search Tags:

big data, data quality, data inconsistency, Map-Reduce, K-medoids, clustering algorithm

PDF Full Text Request

Related items

1	Research On Incremental Multiple Medoids Clustering Algorithm Based On Weighted For Big Data
2	The K-MM clustering algorithm based on K-means and K-medoids in data mining
3	Research On Algorithms Of Big Data's Consistency Quality Analysis
4	Research On The Key Technologies Of Distributed Big Data Consistency Management
5	Research On K-medoids Clustering Algorithm Under Privacy Protection Model
6	Research On Image Clustering Based On K-medoids Algorithm
7	Research And Application Of K-medoids Clustering Algorithm Based On ?_o-neighborhood Search Strategy
8	Research On K-medoids Clustering Optimization Algorithm Based On Swarm Intelligence
9	Oneof Text Clustering Algorithm Based On Big Data
10	Study On Knowledge Servicce And K-Medoids Algorithm Improving In Big Data Environment