Font Size: a A A

Research On Data Quality And Cleaning Evaluation Technology In Networking Audit

Posted on:2017-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:M Y ZhouFull Text:PDF
GTID:2348330518470809Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Audit field has been from the traditional manual audit to computer audit,resulting in more and more data,but these data does not produce the information,so often the data is a lot but knowledge is poor.Data quality is the key to determine the quality of the data,only good data quality can help people make the right decisions,come to the credibility of information.To evaluate the data quality and complete the data cleaning,which is the common method to improve the data quality.This paper mainly studies the method of data quality assessment and data cleaning method in the field of audit.In this paper,we studied the principle of data cleaning and the method of cleaning different kinds of dirty data.The audit data has its own characteristics:the abnormal data in the data may be a reflection of the abnormal phenomena of the thing.Data quality assessment,the more effective abnormal data,the more the data quality is higher.In the network audit can be assessed by the data of the potential online search.Based on the characteristics of audit data,this paper presents a method for evaluating the potential of the audit in the field of audit.In data cleaning,the common field matching algorithm:Levenshtein Smith distance,Waterman distance,Hamming distance is presented and a detailed analysis of the algorithm.Based on the idea of "sorting and merging",the basic nearest neighbor sorting algorithm,the multi neighbor ranking algorithm,the priority queue algorithm are studied,and the algorithm of local sensitive hash is proposed.Compared to the algorithm based on the sorting and merging algorithm.And the "sorting and merging" algorithm is sensitive to the key words,the different ranking methods may generate different clustering,and the algorithm based on local sensitive hash is not sensitive to the order of key words.Because of the relatively small number of duplicate records,the "sorting and merging" algorithm has a lot of different records,and the algorithm based on local sensitive hash has reduced the number of similar repeated records,and reduced the number of times of similar repeated records.The experimental results show that the duplicate records of local sensitive hash detection algorithm in recording the comparing times better than the traditional algorithm based on the comparison of the number of records of an order of magnitude less than the traditional algorithm,but in precision and recall is slightly lower than the traditional algorithm.
Keywords/Search Tags:Audit, Data quality, Data cleaning
PDF Full Text Request
Related items