Font Size: a A A

Data Quality Assessment And Improvement:Methods And Application

Posted on:2016-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:J Z TangFull Text:PDF
GTID:2348330503494360Subject:Business Administration
Abstract/Summary:PDF Full Text Request
In data production, storage and transmission processes, there are inevitably data problems, missing data and other issues. The existence of the data is due to its value, some data is used as the evidence, others is used for analysis and forecasting. Whatever data is used for, if the data has error, missing or other problems, the value of the data will be reduced, sometimes even brings huge losses, so ensuring data high quality is the foundation of the effective use of data. In recent years, data has gotten explosive growth, which has been increasing people's interest in big data, most of people always find better methods to analyze big data, but few people really care about data quality, which leads to few papers about data quality. In this paper, by reviewing literatures about data quality, I summarized the research achievements in this area, my research is mainly about methods of evaluating data quality and improving data quality, and apply clustering and categorization in evaluating data quality, introduce and compare some usual-used methods of clustering and categorization, and explore whether they are feasible. As we know, any measurement results have deviation, I apply Gage R&R and analysis of variance to evaluate the feasibility of evaluation methods like clustering and categorization, and simply introduce different improvement methods of data quality according to different data quality problems. I take the consumer complaints data as an example and choose K-means clustering to evaluate its quality, clustering number is equal to artificial classification number, clustering names correspond to artificial classification names, then establish assessment matrix, and use Van Rijsbergen's FI value theory to get assessment values, and apply measurement tool analysis method to analyze these values, finally, I find that clustering analysis methods is feasible for assessing data quality. In terms of data quality improvement, I introduce unstandardized data quality improvement in detail, key-words matching method is adopted to realize the standardization of data. After data exploration and quality assessment, I put forward some suggestions on data quality management, which can be divided into three parts: the database design stage, data production process and data post-processing.
Keywords/Search Tags:data quality management, K-means clustering, Gage R&R, complaints data
PDF Full Text Request
Related items