Font Size: a A A

Research And Implementation Of Data Imputation Technology Based On Spark

Posted on:2018-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:J R YanFull Text:PDF
GTID:2348330518496703Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the popularity of mobile Internet, the number of data is growing explosively. Data is "rich" while people often complain that information is"poor", which can be attributed to the poor controllability of data quality.Therefore, data cleaning, the first step of ETL, is drawing more and more researchers' attention. Fields missing is one of the common problems in the domain of data cleaning, which has the biggest impact on the data mining algorithm. The accuracy of traditional imputation algorithm of missing fields is low, therefore, we need to use a new way to deal with this issue. At the same time, distributed technologies emerge at the right moment when the storage space and the computational speed of a single machine cannot process mass data. In the field of study, it has become a hot topic to clean the data and to optimize the calculation procedure by effectively using the distributed computing technology.This paper presents an imputation algorithm based on association rules, which also is programmed and debugged in the distributed system environment. Major works of this paper include following aspects:(1) Come up with an imputation algorithm of missing fields based on association rules and modify item sets growth, rules selection and other steps of this algorithm to avoid the redundant computation in the process of calculation.(2) Configure the cluster, which includes distributed storage system HDFS, distributed computing framework Spark, data warehouse analysis tool Hive, etc., Install MYSQL and other software used to store meta information.(3) Complete the algorithm implementation based on distributed computing system-Spark, and optimize parts of intermediate result sets'persistence and the load balancing. It not only makes the program more logical, but improves the resource utilization of the system.
Keywords/Search Tags:Data cleaning, Association rules, Missing value Spark, Parallelization
PDF Full Text Request
Related items