Research On Data Cleaning And Model Evaluation Based On Data Mining

Posted on:2018-09-24

Degree:Master

Type:Thesis

Country:China

Candidate:J Zou

Full Text:PDF

GTID:2348330518995471

Subject:Information and Communication Engineering

Abstract/Summary:

In the day of big data era, the value of data earns more and more wide attention by all the walks of life. How to use data cleaning methods to solve the quality problems in the data become the premise of fully discovering the data knowledge and the use of data value. Data quality issues include, but are not limited to, the accuracy, completeness, uniqueness, timeliness, and consistency of data, which can increase the difficulty of discovering data,reduce the value of data, influence people’s correct judgment, discovering the wrong knowledge without knowing, causing the irreparable damage of state and the company. In this paper, data mining methods is used to solve the problem of data cleaning from the aspects of statistical methods and density-based clustering methods, focusing on the problem of abnormal data detection and aiming at reaching the goal of improving data quality.The main contents of this paper are as follows: 1. To investigate the theoretical knowledge of data cleaning technology at home and abroad, to explain the definition of data cleansing in different application scenarios,summarize the current methods and tools of data cleaning and data quality assessment. The data mining and anomaly detection methods, application scenarios and the general steps of data mining are summarized, which lays the theoretical foundation for data cleaning using the statistical methods and density clustering methods. The WLS (Weighted Least Square) state estimation algorithm based on Newton-Raphson power flow algorithm is proposed to estimate the voltage amplitude and voltage phase angle under the steady state of the power system, and an anomaly detection equation based on chi-square test is proposed. Finally, the ability of the method to detect abnormal data is described. The proposed framework includes four parts: missing value processing, feature selection, density feature extraction and anomaly detection,which can refine the general data, especially the unlabeled multidimensional data, and return Clustering results. The performance of DBSACN algorithm,LOF algorithm and traditional algorithm are evaluated according to the actual case of GPS trajectory data cleaning. The performance and efficiency of data cleaning method are evaluated with the precision and recall rate indicators.

Keywords/Search Tags:

data cleaning, data mining, anomaly detection, statistics, cluster

Related items

1	Research And Implement Of Enterprise Lan Traffic Anomaly Detection Based On Data Mining
2	Research On Anomaly Intrusion Detection Of Web Application Based On Data Mining
3	Research On Key Technologies Of Temporal Data Cleaning
4	Research Of Anomaly Detection System Of Network Traffic Based On Data Mining
5	Research Of Anomaly Detection System Of Network Traffic Based On Data Mining
6	Research On Data Flow Anomaly Detection Algorithm Cluster-based
7	Research On Key Techniques Of Anomaly Detection For Big Data Platform
8	Intrusion Detection, Data Mining-based Adaptive
9	A Research On Trajectory Data Cleaning Based On Temporal Feature
10	The Research Of Network Anomaly Detection Based On Data Mining