Font Size: a A A

Design And Implementation Of Data Cleaning System With Definable Rules Based On Spark

Posted on:2020-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z P LiFull Text:PDF
GTID:2428330596975950Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the continuous advancement and rapid development of information technology and other science and technology,every different field is now generating huge amounts of data at an amazing speed nowadays.How to use these data and information to explore more potential wealth and create more value is one of the major issues in the field of computer and mathematics today.Just under such a background,the concept of "Big Data" came into being.Using data mining,machine learning and data visualization technologies to explore trends and predict the future becomes the essence of big data technology.However,in order to ensure the accuracy of the output of various data analysis,the control of the quality of the data is a crucial and non-negligible part.Therefore,in the practical application of big data technology,more than half of the time will be spent on data cleaning.On the other hand,with the expansion of data scale and the emergence of data source and data type diversity,how to design a high-efficiency,high-university data cleaning system is also a key issue we need to explore.This thesis designs and implements a unique data cleaning system based on Spark,a distributed computing engine dedicated to massive data processing.The system encapsulates the specific business logic in the data cleaning process through the business components and uses it as the connection unit for the entire data cleaning process.In order to solve the problem of data transfer between data cleaning business components,this thesis deeply analyzes and improves the Spark's native application submission method,and then implements the business processing mechanism based on the global SparkContext and Spark's built-in interpreter objects to support interactive data cleaning and global data sharing.In order to be highly compatible with the business processing mechanism,an interpreter structure is designed and implemented.Based on this structure,the business component structure specification with the string array package code as the core is defined.Based on the above structural specifications,a series of business components are implemented,and an expansion interface of the business components is provided.The system supports two methods of processing business components.One is to edit the parameters and submit the execution in a single step,the other is to drag and drop the control after editing the parameters and combine them into a business rule diagram in the form of directed acyclic graphs,and then process them by the system.In order to ensure the robustness and correctness of the business rule diagram processing,this thesis starts from the idea of depth-first traversal to check whether the business rule diagram defined by the system user contains a loop structure that causes the business process to enter an infinite loop.In addition,a method of "reverse breadth-first traversal" is proposed,which is used to determine the execution order of business components,and proposes an optimization method based on Spark's native cache technology for the actual execution efficiency of business processes.The experiment proves that the system solves the problem of performance,extendibility,scalability and usability of data cleaning to some extent,and has high practical application value.
Keywords/Search Tags:Spark, Big Data, Data Cleaning
PDF Full Text Request
Related items