An Analysis On Data Cleaning Algorithm And Its Application On Web Logs Processing

Posted on:2018-05-01

Degree:Master

Type:Thesis

Country:China

Candidate:H Xu

Full Text:PDF

GTID:2428330569485436

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Data in real word tends to be dirty.Various qualities of them affect the accuracy and availability of data analysis.In the current field of data cleaning,plenty of algorithms were proposed,which are mostly field targeted.Most of the existing reviews of data cleaning algorithms only focus on particular area.There is not a comprehensive one.Therefore,it has a realistic and important significance to sort out the data cleaning algorithms proposed in recent years,and to analyze and classify them afterwards with the goal of showing the advantages and disadvantage,which can be helpful for researchers interested in this field learning or further improving the algorithms.Based on the demand,data cleaning algorithm can be divided into three categories,which consists of error detection stage,data repairing stage and feature engineering stage.Four classification of duplication recording detection algorithms are analyzed in the error detection stage.Then a classification criteria of algorithms that deal with attribute errors,constraint conflicts and other errors is used based on whether it belongs to quantitative or qualitative method.The missing value imputation,which is one of the most popular fields in the data repairing stage,is classified into four algorithms for comparison.In addition to the classification criteria based on the type of integrity constraint rule used to repair consistency errors,another three classification criteria are discussed.The feature selection and feature construction are two ways of processing features in feature engineering stage.The main difference between them is whether the original feature is transformed.Most of researches focus on feature selection rather than feature construction.Mathematical statistics and machine learning algorithms are popular in the field of feature selection.In recent years,researchers have shown their increasing interests in evolutionary computation algorithms.In terms of feature selection,finding out how to get the global optimal solution while reducing the number of features is the main objective,in which the evolutionary computation algorithm is skilled.Feature construction is a method that project the data into other dimensional space.Several feature construction algorithms are discussed and analysised briefly.Spark cluster was chosen to be the experimental platform,and choose the best data cleaning algorithms in accordance with the characteristic of log to solve missing value imputation and feature selection problem on the real web logs.

Keywords/Search Tags:

Data cleaning, Error detection, Data repairing, Feature Selection, Web log

PDF Full Text Request

Related items

1	Research On Key Technologies Of Temporal Data Cleaning
2	Research On Data Cleaning Method Based On Optimal Feature Selection
3	Key Techniques Of Structured Data Cleaning
4	Data Abnormity Repairing And Its Application Research
5	Effective Rule-based Algorithms For Data Cleaning
6	Design And Implementation Of The Inconsistent Data Repairing Subsystem In The Data Cleaning System
7	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
8	Study Of Traffic Data Cleaning Technical System Based On RFID
9	Research And Implementation Of Health Big Data Preprocessing Methods
10	Image Data Cleaning And Feature Learning In The Presence Of Label Noise