Font Size: a A A

Research On Data Cleaning And Model Evaluation Based On Data Mining

Posted on:2018-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:J ZouFull Text:PDF
GTID:2348330518995471Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In the day of big data era, the value of data earns more and more wide attention by all the walks of life. How to use data cleaning methods to solve the quality problems in the data become the premise of fully discovering the data knowledge and the use of data value. Data quality issues include, but are not limited to, the accuracy, completeness, uniqueness, timeliness, and consistency of data, which can increase the difficulty of discovering data,reduce the value of data, influence people's correct judgment, discovering the wrong knowledge without knowing, causing the irreparable damage of state and the company. In this paper, data mining methods is used to solve the problem of data cleaning from the aspects of statistical methods and density-based clustering methods, focusing on the problem of abnormal data detection and aiming at reaching the goal of improving data quality.The main contents of this paper are as follows: 1. To investigate the theoretical knowledge of data cleaning technology at home and abroad, to explain the definition of data cleansing in different application scenarios,summarize the current methods and tools of data cleaning and data quality assessment. The data mining and anomaly detection methods, application scenarios and the general steps of data mining are summarized, which lays the theoretical foundation for data cleaning using the statistical methods and density clustering methods. The WLS (Weighted Least Square) state estimation algorithm based on Newton-Raphson power flow algorithm is proposed to estimate the voltage amplitude and voltage phase angle under the steady state of the power system, and an anomaly detection equation based on chi-square test is proposed. Finally, the ability of the method to detect abnormal data is described. The proposed framework includes four parts: missing value processing, feature selection, density feature extraction and anomaly detection,which can refine the general data, especially the unlabeled multidimensional data, and return Clustering results. The performance of DBSACN algorithm,LOF algorithm and traditional algorithm are evaluated according to the actual case of GPS trajectory data cleaning. The performance and efficiency of data cleaning method are evaluated with the precision and recall rate indicators.
Keywords/Search Tags:data cleaning, data mining, anomaly detection, statistics, cluster
PDF Full Text Request
Related items