Font Size: a A A

Research Of Key Technology In Massive Data Cleaning

Posted on:2019-03-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:F F FanFull Text:PDF
GTID:1368330623953337Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the sharp growth of data scale and the continuous enrichment of data types,the idea of“data as asset”has been widely recognized by enterprises and government agencies.As the driving force of innovations,“data”has become an important productivity after“land”and“capital”.There is a popular motto in data science community:“Garbage-in,Garbage-out”.It is usually difficult to get highly credible rules(in data mining)or accurate models(in machine-learning)over the datasets with quality defects,no matter how advanced the model or algorithm is.Therefore,high quality of data is an important prerequisite and foundation to fully excavate the huge hidden value.Data quality is primarily used to evaluate the extent to which data meets the appli-cation needs,including consistency,completeness,uniqueness,timeliness and accuracy.As an important means of improving data quality,the technique of data cleaning faced many challenges in big data era:(1)Due to the huge data scale,the traditional data cleanings algorithm with polynomial time complexity are no longer feasible,thus it is necessary to develop new algorithms with linear or near-linear complexity;(2)Due to the rich data types,the traditional rule-based data cleaning methods are no longer feasible,thus it is necessary to seek more compact data representation and more efficient data cleaning algorithm based on”correlation”within big data;(3)With more big data ap-plications shifting from off-line to real-time on-line scenarios,the traditional algorithms become intractable to meet the on-line requirements,thus it is necessary to devise new data cleaning algorithms for the new application scenarios.To address the principal challenges of data cleaning in big data,this dissertation starts with“uniqueness”and“completeness”,and carries out the key technology research of large-scale data cleaning:(1)Attribute value matching in relational data:to alleviate the limitation of string-based approaches in attribute value matching,a basic model of“Value Correlation Anal-ysis”,or VCA,is proposed;to alleviate the limitation from“equality matching”in the basic model,an extended model is proposed to introduce the similarities of correlated attribute values into consideration;when multiple correlated attributes are available,a weighted sum approach is used to merge those VCA,and the concept of”Conditional Correlation Factor”is proposed to evaluate the weights of VCA;by evidential reasoning,the string similarity and the weighted sum of VCA are fused into a unified probability measurement,which significantly improves the effectiveness of attribute value matching.Moreover,to reduce the complexity of O(N~2)in estimating VCA,an approach based on the cosine-similarity over aggregated word vector is proposed to match equivalent values,and the complexity is finally reduced to O(N+N_T~2)from O(N~2),in which N_T?N,and N_Tand N denote the number of distinct attribute values and tuples respectively.(2)Automatic records matching across data sources:in the process of record match-ing across data sources,there is a huge difference in the similarity distribution between the matching records and un-matching ones;additionally,the proportion of matching records usually decreases with the increasing data size,thus those matching records become the”outliers”from the statistical point of view.Based on above insights,an automatic algo-rithm based on outliers-detection is proposed for entity matching.The principal compo-nent analysis method is employed to transform the possible linearly correlated similarity vectors into linear independent representations,which breaks through the limitation from conditionally independence assumption imposed in traditional classical models,thus has broader application scenarios.(3)Relational data imputation with quality guarantee:to alleviate the limitation of neighbor-based methods,the concept of General Feature Dependency(GFD)is proposed;based on the GFD,“matching probability”is defined,and a monotonicity relationship is established between“threshold of matching probability”and“imputation precision”;based on the monotonicity and VC dimension theory in statistical machine learning,we devise a model for missing value imputation with quality guarantee,which seeks to maximize the imputations while ensuring the achieved precision meeting the user-specified level.Moreover,based on the mechanism for quality guarantee,an on-line missing value imputation model,or OL-MVI,is proposed to impute the missing data on-the-fly.
Keywords/Search Tags:Big Data, Data Cleaning, Equivalent Attribute Value Matching, Entity Matching, Missing Data Imputation
PDF Full Text Request
Related items