Verification Method Of Data Quality In Shaanxi Science And Technology Cloud

Posted on:2018-02-01

Degree:Master

Type:Thesis

Country:China

Candidate:L J Lei

Full Text:PDF

GTID:2348330521951168

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

The era of big data presents new challenges for enterprise storage,data management and data mining.Data quality is a prerequisite for effective analysis and utilization of big data.Shaanxi provincial science and technology department through the collection and reporting to collect a large number of scientific and technological information resources from decentralized nodes which are cross-regional and cross-sectorial.However,some reasons make the incomplete,inconsistent,abnormal data and other serious data quality problems which are no unified data format between the node departments,man-made errors result in duplicate data entry or incorrect registration and unknown,omission,missing data of technical information data.This paper analyzes and summarizes the problem data quality in project based on the Shaanxi Science and Technology Resource Co-ordination Center "Science and Technology Cloud" project.The two major problems in the quality of scientific and technological information are verified.One is the redundancy of the data caused by the abbreviation of the organization's name and the other is the partial missing value of scientific and technological information.First of all,this paper introduce the research background and significance of data quality and expound the basic concept of data quality and the key technology of data preprocessing of "Technology Cloud" platform.Then,this paper build the "Science and Technology Cloud" data quality assessment dimension and use it to formulate data quality assessment criteria for scientific and technical information data.Next,this paper study the verification of simple duplicate data,including complete duplication of information and similarity repeat information.On this basis,this paper put forward the verification methods of data redundancy resulting by the abbreviation of organization name in scientific and technical information data.In the technical information data,data loss resulting in inaccurate data mining results and the lack of key attributes resulting in reduced data samples will affect the results of data analysis.In order to solve the problem of missing data,we propose to use the method of nearest neighbor interpolation and association rule to fill the missing data.This paper designs and implements the above methods to apply to the "technology cloud" project.This paper extracts 15643 data from scientific and technical talent pool and scientific literature library.The experimental results test and verify the effectiveness and feasibility of data redundancy and data missing value filling method in "Science and Technology Cloud".

Keywords/Search Tags:

Science and Technology Data, Data Quality, Data Redundancy, Missing Value Processing

PDF Full Text Request

Related items

1	Research On Querying Missing Data
2	Data Preparation For Risk Control Of Medical Insurance Fund
3	Research On Passenger Transport Data Quality Detection And Missing Data Imputation
4	Study On Data Dependency_Based Data Quality Processing Techniques In Data Integration
5	Research And Implemetation Of Data Redundancy Elimination Technology For Wide Area Network
6	A Study On SVM Algorithm For Missing Data Processing
7	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
8	Designing Of Noxious Matter Data Processing System Estimating Missing Data In An Animal Barn
9	Study On Data Quality Assessment Techniques For Telecom Data Mining
10	Technology For Answering Queries On Incomplete Data