Research And Implementation Of Data Imputation Technology Based On Spark

Posted on:2018-08-18

Degree:Master

Type:Thesis

Country:China

Candidate:J R Yan

Full Text:PDF

GTID:2348330518496703

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

As the popularity of mobile Internet, the number of data is growing explosively. Data is "rich" while people often complain that information is"poor", which can be attributed to the poor controllability of data quality.Therefore, data cleaning, the first step of ETL, is drawing more and more researchers' attention. Fields missing is one of the common problems in the domain of data cleaning, which has the biggest impact on the data mining algorithm. The accuracy of traditional imputation algorithm of missing fields is low, therefore, we need to use a new way to deal with this issue. At the same time, distributed technologies emerge at the right moment when the storage space and the computational speed of a single machine cannot process mass data. In the field of study, it has become a hot topic to clean the data and to optimize the calculation procedure by effectively using the distributed computing technology.This paper presents an imputation algorithm based on association rules, which also is programmed and debugged in the distributed system environment. Major works of this paper include following aspects:(1) Come up with an imputation algorithm of missing fields based on association rules and modify item sets growth, rules selection and other steps of this algorithm to avoid the redundant computation in the process of calculation.(2) Configure the cluster, which includes distributed storage system HDFS, distributed computing framework Spark, data warehouse analysis tool Hive, etc., Install MYSQL and other software used to store meta information.(3) Complete the algorithm implementation based on distributed computing system-Spark, and optimize parts of intermediate result sets'persistence and the load balancing. It not only makes the program more logical, but improves the resource utilization of the system.

Keywords/Search Tags:

Data cleaning, Association rules, Missing value Spark, Parallelization

PDF Full Text Request

Related items

1	Research And Application Of Parallelization Of Association Rule Mining Algorithm
2	Design And Implementation Of Data Cleaning System With Definable Rules Based On Spark
3	Research And Application Of Parallel FP-Growth Algorithm Based On Spark
4	Research On Data Mining Based Decision Rules And Association Rules
5	Association Rule Algorithm Optimization And Parallelization Research Based On Spark
6	An Improved Algorithm Of Association Rules Based On The Spark
7	Research On Association Rules Parallel Optimization Algorithm And Application
8	Research And Application Of Association Rule Algorithm Based On Spark Platform
9	Research On Association Rules Algorithm For Massive Telecommunication Network Alarm Data Based On Spark
10	Distributed Association Rules Algorithm Based On The Spark