Font Size: a A A

The Design And Implementation Of Data Analysis System For Data Cleaning

Posted on:2022-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z ZhaoFull Text:PDF
GTID:2518306725484974Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet,data has shown explosive growth.Although the scale of the data is large,the quality of the data cannot be guaranteed.There are many messy data that require data processing.At the same time,machine learning is becoming more and more important in more and more fields.Good training results are not only related to the training model,but also depend on the quality of the training data.Therefore,data cleaning is an important step in any data analysis.Ideally,we should traverse every variable in the data set to find potential errors in the data set,but this process itself can be very time-consuming,costly,and errorprone.In this thesis,the data analysis system for data cleaning is designed to reduce the time spent by data cleaning personnel in the data cleaning process and improve work efficiency.Through this system,users can upload dataset that need to be cleaned,and the platform automatically generates a quality assessment report to analyze potential problems in the dataset.This thesis designs and implements five statistical indicators and nine inspection strategies,supports dataset in all fields,and is universal.Based on these indicators and strategies,the data set is analyzed and a quality assessment report is generated.The report is an automatically generated,non-technical overview document.Therefore,the system promotes the dialogue between data analysts and domain experts.Due to documentation,the data cleaning process changed from the usual temporary method to a systematic,well-documented work.Based on the quality assessment report,the system also provides the data cleaning function to deal with missing data and outliers.The system is divided into four modules,namely the authority management mod-ule,the quality assessment report generation module,the document management mod-ule and the data cleaning module.The authority management module is responsible for the management of user authentication and user authority to ensure the security of the service;the quality assessment report generation module parses the dataset and generates a quality assessment report for users to download and consult;the document management module controls the dataset and system uploaded by the user Three types of document information such as the generated evaluation report and the successfully cleaned dataset are managed,and the metadata information of the dataset is maintained at the same time;the data cleaning module realizes the cleaning of missing values and abnormal values in the data set.The four modules complement each other and jointly build a data analysis system for data cleaning.The system is finally presented to the user in the form of a web application,which is developed in a way that the front and back ends are separated.The front-end uses the Vue.js framework,the back-end uses the Spring Boot framework to build the sys-tem,uses the My Batis framework to build the data persistence layer,and uses MySQL,Redis and OSS as the data center.At present,the system has been put into use,greatly improving the efficiency of data cleaning for data analysts.
Keywords/Search Tags:Data Cleaning, Data Analysis, Quality Assessment Report
PDF Full Text Request
Related items