Font Size: a A A

An Analysis On Data Cleaning Algorithm And Its Application On Web Logs Processing

Posted on:2018-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:H XuFull Text:PDF
GTID:2428330569485436Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Data in real word tends to be dirty.Various qualities of them affect the accuracy and availability of data analysis.In the current field of data cleaning,plenty of algorithms were proposed,which are mostly field targeted.Most of the existing reviews of data cleaning algorithms only focus on particular area.There is not a comprehensive one.Therefore,it has a realistic and important significance to sort out the data cleaning algorithms proposed in recent years,and to analyze and classify them afterwards with the goal of showing the advantages and disadvantage,which can be helpful for researchers interested in this field learning or further improving the algorithms.Based on the demand,data cleaning algorithm can be divided into three categories,which consists of error detection stage,data repairing stage and feature engineering stage.Four classification of duplication recording detection algorithms are analyzed in the error detection stage.Then a classification criteria of algorithms that deal with attribute errors,constraint conflicts and other errors is used based on whether it belongs to quantitative or qualitative method.The missing value imputation,which is one of the most popular fields in the data repairing stage,is classified into four algorithms for comparison.In addition to the classification criteria based on the type of integrity constraint rule used to repair consistency errors,another three classification criteria are discussed.The feature selection and feature construction are two ways of processing features in feature engineering stage.The main difference between them is whether the original feature is transformed.Most of researches focus on feature selection rather than feature construction.Mathematical statistics and machine learning algorithms are popular in the field of feature selection.In recent years,researchers have shown their increasing interests in evolutionary computation algorithms.In terms of feature selection,finding out how to get the global optimal solution while reducing the number of features is the main objective,in which the evolutionary computation algorithm is skilled.Feature construction is a method that project the data into other dimensional space.Several feature construction algorithms are discussed and analysised briefly.Spark cluster was chosen to be the experimental platform,and choose the best data cleaning algorithms in accordance with the characteristic of log to solve missing value imputation and feature selection problem on the real web logs.
Keywords/Search Tags:Data cleaning, Error detection, Data repairing, Feature Selection, Web log
PDF Full Text Request
Related items