Font Size: a A A

Research On Data Quality Assessment And Construction Of Assessment Platform Based On Machine Learning

Posted on:2022-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:X W JianFull Text:PDF
GTID:2518306332468004Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of industrial Internet,the number of terminal devices is increasing rapidly,and the data transmission and storage is explosive growth.Enterprises and institutions can carry out a series of analysis and prediction through data mining.However,the quality of data is generally uneven.If the data is used directly,it will cause misjudgment of information,resulting in loss of economy and time.Therefore,an appropriate data quality assessment method should be found to evaluate the data quality,so as to ensure a good data quality guarantee for subsequent analysis,prediction and other operations.This thesis studies the mainstream methods of data quality assessment,and analyzes the advantages and key steps of the data quality assessment method based on machine learning.According to the condition that the evaluation data set meets the unbalanced data,the unbalanced data classification algorithm in machine learning is used to evaluate the quality.The classification algorithm was improved at the data level and the algorithm level,and applied to the evaluation process to realize the optimization of data evaluation and build an automated data quality evaluation platform.The main work contents of this thesis are as follows:(1)In order to solve the problem of unbalanced data sets,the data quality classification algorithm is improved at the data level and algorithm level,WSMOTE-CBoost(Weighted-SMOTE-Cost-Sensitive-Boosting)algorithm was proposed.At the data level,based on distance weighting,SMOTE algorithm is improved,and WSMOTE algorithm is proposed.The Euclidean distance and Adaboost weight are used to comprehensively determine the number of sample samples of the minority class,so that the sampling bias is toward the class center,boundary and misclassified samples.At the algorithm level,based on the cost sensitive,the Adaboost algorithm is improved and the CBoost algorithm is proposed.According to the different misclassification costs of positive and negative samples,the cost function is introduced during sample weight iteration to increase the degree of attention to the misclassification of a few classes.Experiments are carried out on public data sets and sensor data sets that are actually accessed by the platform.The results show that the algorithm proposed in this dissertation improves on the two unbiased indexes,F1 value and AUC.(2)Based on the above theoretical research,the functional requirements of the data quality assessment platform is analyzed.The system management module,model management module and data interaction module are designed and implemented,and the WSMOTE-CBoost algorithm is applied to the quality assessment module.Finally,the data quality assessment platform is realized,and the data quality is evaluated automatically,which further verifies the feasibility of the algorithm.
Keywords/Search Tags:data quality assessment, imbalanced data classification, AdaBoost, SMOTE
PDF Full Text Request
Related items