Font Size: a A A

Robust Machine Learning Algorithms For Data Quality Management

Posted on:2022-05-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:L D A l l a d o u m b a y e Full Text:PDF
GTID:1488306569487784Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Heterogeneous and distributed data are currently used for multifaceted applications such as Information Extraction,Data Mining,E-learning,or Web Applications.The quality of any decision-making related to these multifaceted applications depends directly on the quality of the data used.Thus,the absence of correct and reliable information used for such applications by an entity will lead to bad decision-making.Thus,data quality management is pretty important.Data quality management has evolved in its capacity,but the demand for pace and efficiency has proliferated widely.Data management specialists believe that data quality remains a bottleneck that repeatedly threatens data quality management and business association due to proliferating data volumes and the complexity of deriving quality insights.Innovative technology such as machine learning(ML)has made the application of large-scale advanced data analytics more tangible than before.A transition in the data quality process is noticeable from a static rule-based method to a dynamic,self-adapting,ML-based method in various domains.ML has the potential to evaluate the quality of data resources,predict missing data,and provide cleaning recommendations,so reducing the complexity and efforts spent by data quality experts and scientists.Robust in ML uses algorithms or calculation sequences that allow algorithms to learn from data and improve their performance.Prediction-generating models are constantly updated based on data outputs,allowing the system to refine the model by itself.It allows the learner to compare the model's predictions against the actual outcome,using that data to fine-tune the parameters feeding the model's predictions.Due to its importance,this thesis aims to study robust machine learning algorithms for data quality management to develop algorithms that can detect,sense,and learn phenomena as humans or may be more efficient than humans in practical value.Less human interaction and effective performance are the main points in machine learning robustness.A good representation of data could be needed to design a basic algorithm.Without the robust machine learning model,this simplified representation can be a challenging problem.Therefore,in such a problem,it is better to design a model that can learn data by extracting attributes or features and transform the data according to the model's need to improve its efficiency and performance.This dissertation comprises five chapters.To overcome data quality issues,this dissertation firstly proposes four adequacies(4As)data quality in Use model to prove data quality research's indicators.This model appropriately obtained the quality in use levels of the entry data for Data Analytics.Those adequacies of Data Quality in Use model levels could be comprehended as dependability indicators of adequacy of Data quality investigation.The model can evaluate the level of Quality in Use of the data to produce repeatable and usable research results.Taking advantage of the benefits of using international standards like ISO/IEC 25012 and ISO/IEC 25024 is one of those best practices and very convenient.The further study buildup a complete robust conceptual and technological stack and raw,processed data,data management,and analytics.Evaluating data quality in Use model has gained more ground since business value could only be estimated in its used context.None of the robust models have been amended to data quality problems among the numerous data quality models used for regular data quality assessment.This study adopted 4As data quality in use model for further improvement to assess the quality of data.Therefore,this model is independent of any pre-conditions or technologies integrated into various data quality research.Secondly,this dissertation proposes an extended complete formal taxonomy and a robust novel Dedupe Learning Method(DLM)to detect and correct contextual data quality anomalies than existing taxonomy techniques and highlight the demerit of Support Vector Machine(SVM).These robust methods were created and implemented on structured data.Consequently,these proposed methods are summarized in detecting and correcting the duplication problems by estimating the similarity match,and then the strength of similarity is identified to merge the duplication problem.For the robustness of this method in duplication problems,we join records through the integration of fuzzy way,indexing,and blocking using data such as names,addresses,phones,dates,et and also set the threshold from 0.5 to 0.65.Thirdly,this study proposed a novel modulo 9 with wide-ranging experiments compared with robust machine learning techniques such as Support Vector Machine(SVM)Algorithm,Linear Regression(LR),K-Nearest Neighbors(KNN),NaĻ?ve Bayes(NB),Support Vector Classifier(SVC),Linear Support Vector Classifier(LSVC),Random Forest Classifier(RFC),Decision Tree Regressor(DTR),Deletion Method,Multi-Layer Perceptron(MLP),and the Mean Value.However,this method illustrates how missing data can affect machine learning algorithms and decision-making based on data analysis output.The study further proposed Modulo 9 as a novel method for handling missing data problems.Modulo 9 prevent integer overflows since the problem constraints are integers,and only efficient algorithms can solve them in an allowed limited time.It does not give a run-time error or exception but instead bogus computation and stores the bogus result since bit size comes after multiplication overflows.The results show that the novel method outperforms the eleven existing methods.Fourthly,this research proposed a robust,novel Stacked Dedupe Learning ER system with high accuracy and efficiency,along with less human interference was formed.Sophisticated composition methods,especially Bidirectional Recurrent Neural Networks(Bi RNNs)and Long Short-Term Memory(LSTM)hidden units,renovate each tuple to word representation distribution in a sense to capture similarities amidst tuples was evaluated.Pre-trained words embedding were not available;ways to learn and tune word representation distribution customized for ER tasks under different scenarios were considered.Moreover,the Locality Sensitive Hashing(LSH)based blocking approach,which considered the entire attributes of a tuple and produced slighter blocks,compared with traditional methods with few attributes,was assessed.The algorithm was tested on multiple datasets namely benchmarks,and multi-lingual data.The results showed that Stacked Dedupe Learning delivers a respectable balance between efficiency and accuracy with existing solutions.Finally,it provides a structure relating to the challenges and methods.This dissertation offers a perspective on the field through this procedure,identifies research gaps and opportunities,and provides a clear base and support for more research in data quality and Machine Learning.
Keywords/Search Tags:data quality, Machine Learning, Missing Data, Entity Resolution, recurrent neural network(RNN), long short term memory(LSTM), Taxonomy
PDF Full Text Request
Related items