Font Size: a A A

Verification Method Of Data Quality In Shaanxi Science And Technology Cloud

Posted on:2018-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:L J LeiFull Text:PDF
GTID:2348330521951168Subject:Engineering
Abstract/Summary:PDF Full Text Request
The era of big data presents new challenges for enterprise storage,data management and data mining.Data quality is a prerequisite for effective analysis and utilization of big data.Shaanxi provincial science and technology department through the collection and reporting to collect a large number of scientific and technological information resources from decentralized nodes which are cross-regional and cross-sectorial.However,some reasons make the incomplete,inconsistent,abnormal data and other serious data quality problems which are no unified data format between the node departments,man-made errors result in duplicate data entry or incorrect registration and unknown,omission,missing data of technical information data.This paper analyzes and summarizes the problem data quality in project based on the Shaanxi Science and Technology Resource Co-ordination Center "Science and Technology Cloud" project.The two major problems in the quality of scientific and technological information are verified.One is the redundancy of the data caused by the abbreviation of the organization's name and the other is the partial missing value of scientific and technological information.First of all,this paper introduce the research background and significance of data quality and expound the basic concept of data quality and the key technology of data preprocessing of "Technology Cloud" platform.Then,this paper build the "Science and Technology Cloud" data quality assessment dimension and use it to formulate data quality assessment criteria for scientific and technical information data.Next,this paper study the verification of simple duplicate data,including complete duplication of information and similarity repeat information.On this basis,this paper put forward the verification methods of data redundancy resulting by the abbreviation of organization name in scientific and technical information data.In the technical information data,data loss resulting in inaccurate data mining results and the lack of key attributes resulting in reduced data samples will affect the results of data analysis.In order to solve the problem of missing data,we propose to use the method of nearest neighbor interpolation and association rule to fill the missing data.This paper designs and implements the above methods to apply to the "technology cloud" project.This paper extracts 15643 data from scientific and technical talent pool and scientific literature library.The experimental results test and verify the effectiveness and feasibility of data redundancy and data missing value filling method in "Science and Technology Cloud".
Keywords/Search Tags:Science and Technology Data, Data Quality, Data Redundancy, Missing Value Processing
PDF Full Text Request
Related items