Design And Implementation Of Data Cleaning System With Definable Rules Based On Spark

Posted on:2020-01-25

Degree:Master

Type:Thesis

Country:China

Candidate:Z P Li

Full Text:PDF

GTID:2428330596975950

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the continuous advancement and rapid development of information technology and other science and technology,every different field is now generating huge amounts of data at an amazing speed nowadays.How to use these data and information to explore more potential wealth and create more value is one of the major issues in the field of computer and mathematics today.Just under such a background,the concept of "Big Data" came into being.Using data mining,machine learning and data visualization technologies to explore trends and predict the future becomes the essence of big data technology.However,in order to ensure the accuracy of the output of various data analysis,the control of the quality of the data is a crucial and non-negligible part.Therefore,in the practical application of big data technology,more than half of the time will be spent on data cleaning.On the other hand,with the expansion of data scale and the emergence of data source and data type diversity,how to design a high-efficiency,high-university data cleaning system is also a key issue we need to explore.This thesis designs and implements a unique data cleaning system based on Spark,a distributed computing engine dedicated to massive data processing.The system encapsulates the specific business logic in the data cleaning process through the business components and uses it as the connection unit for the entire data cleaning process.In order to solve the problem of data transfer between data cleaning business components,this thesis deeply analyzes and improves the Spark's native application submission method,and then implements the business processing mechanism based on the global SparkContext and Spark's built-in interpreter objects to support interactive data cleaning and global data sharing.In order to be highly compatible with the business processing mechanism,an interpreter structure is designed and implemented.Based on this structure,the business component structure specification with the string array package code as the core is defined.Based on the above structural specifications,a series of business components are implemented,and an expansion interface of the business components is provided.The system supports two methods of processing business components.One is to edit the parameters and submit the execution in a single step,the other is to drag and drop the control after editing the parameters and combine them into a business rule diagram in the form of directed acyclic graphs,and then process them by the system.In order to ensure the robustness and correctness of the business rule diagram processing,this thesis starts from the idea of depth-first traversal to check whether the business rule diagram defined by the system user contains a loop structure that causes the business process to enter an infinite loop.In addition,a method of "reverse breadth-first traversal" is proposed,which is used to determine the execution order of business components,and proposes an optimization method based on Spark's native cache technology for the actual execution efficiency of business processes.The experiment proves that the system solves the problem of performance,extendibility,scalability and usability of data cleaning to some extent,and has high practical application value.

Keywords/Search Tags:

Spark, Big Data, Data Cleaning

PDF Full Text Request

Related items

1	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
2	Research And Implementation Of Data Imputation Technology Based On Spark
3	Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework
4	Research And Implementation Of Collaborative Filtering Recommendation System Based On Spark Big Data Processing
5	Agricultural Product Price Analysis And Forecast System Design Based On Hadoop+Spark Platform
6	Research And Implementation Of Collaborative Filtering Recommendation System Based On Spark Large Data Processing
7	Based On Spatial-temporal Correlation Sensory Data Cleaning Research
8	Design And Implementation Of Telecom 4G Big Data Platform For Network Optimization Based On Spark
9	Key Techniques Of Structured Data Cleaning
10	Some Main Technology's Research Of Data Cleaning