Font Size: a A A

Research And System Construction Of Data Preprocessing Mechanism

Posted on:2019-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:W Z ChongFull Text:PDF
GTID:2428330566999384Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As information technology is widely used in all walks of life,the amount of data is increasing rapidly,and big data has emerged.Because of various reasons,inevitably appeared the data quality question,these data with data quality problems make the conclusion of data mining and data analysis wrong,which may bring unexpected consequences to scientific research and enterprises.In order to improve the data quality of data mining dataset,first step is to preprocess the data set,missing data imputation and outlier detection are particularly important in data preprocessing.However,the current missing value imputation algorithms have the following problems: low imputation accuracy,poor imputation stability,as the data loss rate increases,the imputation accuracy drops rapidly,for single machine systems,large scale data cannot be processed.Outlier detection algorithms have problems such as detection accuracy dependent parameters,it is mainly for small scale data,and has limited capacity for large scale data.In view of the above situation,this thesis has carried on the massive research and the expansion in this aspect,the main research work includes the following content(1)The existing algorithms for missing data imputation are summarized,and their advantages and disadvantages are analyzed.A missing data imputation technique based on evidence chain is proposed.Introducing the concept of evidence chain,and uses the combination of all relevant attribute values of missing data as the estimation value of missing value.Since there may be multiple combinations of related attribute values,there may be many evidences in the evidence chain.Thus improving the imputation accuracy,with a more stable imputation effect,will not increase with the deletion rate,the imputation effect decreased significantly and so on.The algorithm is distributed and parallelized so that it can imputation the missing data of massive data.(2)Summarize the existing algorithms of outlier detection,and then propose an outlier detection technology based on data features.A new definition of outliers is given and an outlier detection algorithm based on data features is proposed,based on knowledge of attribute importance of rough sets,dimensionality reduction of high dimensional data is carried out so that the algorithm can effectively process high dimensional data sets.The data features reflect the degree of differentiation between the data tuple and the overall data set.The algorithm detects which data features are significantly different from the overall data set.The algorithm does not depend on the prior knowledge of the data set and does not need to select the parameters,it still has a high outlier detection accuracy,which is suitable for processing massive data sets on distributed platforms.(3)Based on the above missing value imputation algorithm and outlier detection algorithm,this thesis designs and constructs a big data preprocessing system.The system is based on Hadoop distributed computing platform to achieve the distributed storage of data sets,distributed missing data imputations and outlier detection of large scale data sets.The upper layer of the system uses WEB technology to facilitate user interaction and visualize the results of the processing.
Keywords/Search Tags:big data, data preprocessing, missing data imputation, outlier detection, distributed parallelization
PDF Full Text Request
Related items