Font Size: a A A

The Design And Implementation Of Data Anomaly Detection And Repair Method Based On Spark

Posted on:2020-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:K H WangFull Text:PDF
GTID:2428330572973572Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of data technology,data mining plays an increasingly important role in promoting the industrial development of various industries.Data mining requires a high-quality data,however,most of the natural data usually contains plenty of abnormalities such as duplicate values,missing values,and outliers.These low-quality data leads to completely different results or even disaster.Therefore,the problem of how to detect and repair anomaly data with low-quality could not be ignored.Currently,relevan t resear-ches at home and abroad do not consider the relationship of data features in terms of outliers,which reduces the accur acy of detection,and the filling of general data is not good in terms of missing values.To address the problems above,this thesis designs and implements a system proposing a two-stage algorithm for data anomaly detection and repair.The first stage is a packet-based data anomaly detection algorithm,which improves the traditional LOF(Local Outlier Factor)algorithm,and considers the correlation between attribute variables.CF-Tree is used to group attribute variables.Finally,the algorithm is parallelized and rewritten to improve the detection effect.The second stage is an anomaly data repair algorithm built on random forest algorithm.This algorithm repairs the anomaly data received in the first stage.In order to accommodate differen t types of dataset,the improved grid search method is introduced to continuously adjust the prediction model,and generated high-quality data through dynamic parameter adjustment and parallelization.The experimental results indicate that the method designed in this thesis is superior to the comparison algorithm in terms of data.
Keywords/Search Tags:data detection, data repair, Spark, grouping, dynamic parameter
PDF Full Text Request
Related items