Font Size: a A A

The Research On Key Issues Of Data Quality Management,Assessment And Detection In BigData Environment

Posted on:2020-04-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:1368330575978766Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the continuous development of the information society,information systems are full of massive,multi-structured and multi-dimensional data resources.The value of big data has been fully recognized by the society.How to mine the value of data has become the most concerned issue in various research fields and application fields.Whether the data is garbage or treasure,the most important question is whether the data to be analyzed and mined is of high quality.A low-quality data source will not only fail to reflect the value of the data,but may run counter to the actual situation,but also play a side effect.At present,domestic and foreign research institutes and scholars have put forward many methodologies and frameworks for data quality management and detection,but lack of specific means of implementation in practical application makes it difficult to implement data quality management.In view of the key problems of data quality management,evaluation and detection,the following work has been done in this paper:(1)Aiming at the problem of data quality management,this paper makes a thorough comparison and analysis of the current mainstream data management methods and frameworks at home and abroad,and combs out the general method flow and index system of data quality management.This paper puts forward six measurement methods of important data quality indicators and calculation formulas,which provide effective guidance for data quality management and evaluation.At the same time,a data quality maturity model is proposed for the implementation of data quality management,which provides a reference for the overall evaluation of data quality.(2)To solve the problem of data preprocessing,a data discretization preprocessing algorithm is proposed in this paper.In the large data environment,the frequency of data generation and update is accelerating.More data enter the information system in a continuous way.It needs discrete processing to be processed by the information system.The efficiency and effect of discrete processing play a vital role in the follow-up data quality detection and evaluation.Therefore,this paper proposes an efficient and accurate data discretization algorithm(ICACC,Improved class-attribute contingency coefficient method),which can effectively improve the efficiency and accuracy of large data applications when continuous data is converted into discrete data.The experimental results show that the accuracy of the algorithm is improved by 10% compared with the traditional processing algorithm.(3)For the problem of data quality detection,difference detection and integrity detection are the two most important aspects.Two detection methods are proposed in this paper.For the problem of data difference detection,outlier detection is an important research content.It is of great significance in outlier recognition,filtering and the application of outlier itself.Traditional outlier detection requires data analysts and engineers to identify outliers in data based on experience or original business rules.This is not only a very time-consuming process,but also a very low accuracy,and greatly restricts the information system.Therefore,this paper proposes a data quality difference detection method(M-SPC,Machine-Statistics procedure control),which combines deep learning method with statistical process control.It can detect outlier data by using neural network algorithm and process control.The experimental data proves that the method is effective.Aiming at data integrity detection,this paper designs an adaptive data integrity detection method based on random algorithm and MD5 encryption algorithm through comprehensive detection of data importance,network busy degree,transmission process duration and fault conditions.Experiments show that this method can effectively detect the integrity of data transmission process and improve the application of data value.(4)For data quality assessment,data validity assessment is the most concerned aspect at present.How to find the available data from the huge and complex data is very important for the efficiency and data application value of the processing system itself.The characteristics of machine learning algorithms and data processing methods are very suitable for dealing with such problems.In order to solve this problem,this paper proposes a data validity evaluation algorithm(MKS,MST K-means Slope one),and improves the validity of the original data in practical application by adding time weights.Finally,it is verified by experiments.Specifically,data quality management and detection are different from QOS(Quality of Service)detection.RFC3644 clearly stipulates QOS.It refers to the use of various basic technologies to provide better service capability for network communication and application.It is also a network security mechanism,aiming at solving the problems of network delay and congestion.In this paper,data quality management and detection,specific pointer to the overall framework of data quality,process and evaluation dimensions of methodology and implementation.
Keywords/Search Tags:BigData, data quality management, maturity model, discretization, neural network, outlier detection, statistical process control
PDF Full Text Request
Related items