Bigdata Cleaning Framework Design And Implementation Based On Spark

Posted on:2017-04-19

Degree:Master

Type:Thesis

Country:China

Candidate:H W Jin

Full Text:PDF

GTID:2308330482481821

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

When it comes to big data processing problem, big data cleaning takes more than two-thirds part of work. Though the core of big data processing belongs to big data analysis, big data cleaning is the foundation. Therefore, high efficient big data cleaning technology could not only improve data quality, but also the whole big data process speed.This paper introduces a big data cleaning technology framework which based on Spark. And implemented the framework system. The theory inside is constructing a bigdata cleaning operations pipeline, by using the ability of Spark distributed calculation power and apply Resilient Distributed Datasets as the basic data structure. RDD is packaged with data processing actions as a job unit. The pipeline combines many such kind of units and achieves big data cleaning goals. Besides, this paper proposal a multi-way tree structure to define big data cleaning process, which improved the performance. Therefore, the framework could reuse many data cleaning unit, flexibly configure pipeline, implement high performance and extendibility based on spark, satisfy any requirements under real situations. It thoroughly solves big data cleaning problem and speeds it up.The paper result shows that, based on the framework, users could efficiently reduce the process coupling, by reusing job units, flexibly implement complicated big data cleaning and bring down the cost. It is the most important feature that the framework improved the whole performance into a new level, promoted the big data technology development.

Keywords/Search Tags:

Bigdata, Cleaning, Framework, Spark, Pipeline

PDF Full Text Request

Related items

1	Design And Implementation Of Data Cleaning System With Definable Rules Based On Spark
2	The Design And Implementation Of Honeypot System Based On Spark
3	Design And Implementation Of Real-time Streaming Module Based On Spark Streaming
4	Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework
5	Pipeline Cleaning Robot Manipulator Of Kinematics And Reliability Analysis
6	Bigdata Job Performance Prediction Based On Apache Spark
7	Research And Design Of A Robot For Pipeline Inspection And Cleaning
8	Kinematics Analysis And Structure Optimization Of Pipeline Cleaning Robot
9	Research And Implementation Of Large Scale Image Database Feature Extraction Technology Based On Spark
10	Video Streaming Media Bigdata Study Based On Semantic Analysis Process