Font Size: a A A

Bigdata Cleaning Framework Design And Implementation Based On Spark

Posted on:2017-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:H W JinFull Text:PDF
GTID:2308330482481821Subject:Computer technology
Abstract/Summary:PDF Full Text Request
When it comes to big data processing problem, big data cleaning takes more than two-thirds part of work. Though the core of big data processing belongs to big data analysis, big data cleaning is the foundation. Therefore, high efficient big data cleaning technology could not only improve data quality, but also the whole big data process speed.This paper introduces a big data cleaning technology framework which based on Spark. And implemented the framework system. The theory inside is constructing a bigdata cleaning operations pipeline, by using the ability of Spark distributed calculation power and apply Resilient Distributed Datasets as the basic data structure. RDD is packaged with data processing actions as a job unit. The pipeline combines many such kind of units and achieves big data cleaning goals. Besides, this paper proposal a multi-way tree structure to define big data cleaning process, which improved the performance. Therefore, the framework could reuse many data cleaning unit, flexibly configure pipeline, implement high performance and extendibility based on spark, satisfy any requirements under real situations. It thoroughly solves big data cleaning problem and speeds it up.The paper result shows that, based on the framework, users could efficiently reduce the process coupling, by reusing job units, flexibly implement complicated big data cleaning and bring down the cost. It is the most important feature that the framework improved the whole performance into a new level, promoted the big data technology development.
Keywords/Search Tags:Bigdata, Cleaning, Framework, Spark, Pipeline
PDF Full Text Request
Related items