Internet Open Source Data Availability Evaluation System

Posted on:2018-06-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Zhang

Full Text:PDF

GTID:2348330533469816

Subject:Computer science and technology

Abstract/Summary:

PDF Full Text Request

In the era of big data,people can easily and quickly access all kinds of data on the Internet through different channels.These data,due to human input errors,data versions and sources are different,human tampering and other reasons,tend to exist data attributes missing,similar data duplication,abnormal data attributes and other issues.These erroneous data may lead to data redundancy,waste of storage space,more serious,it may lead to serious deviation in data mining,and give erroneous decisions [1].In view of the above problems,we need to identify three types of erroneous data,evaluate the availability of data,and establish the index system to score the availability of data.In this paper,aiming at the problem of data usability evaluation,the data usability evaluation system is designed and implemented,and the recognition methods of attribute missing data,similar repetitive data and numerical anomaly data are designed to calculate the numerical value of quantitative evaluation index.It also puts forward a comprehensive evaluation of data availability from seven aspects of accuracy,consistency,integrity,uniqueness,timeliness,operability,applicability and so on,and establishes the data availability evaluation system.In the first part of this paper,we identify and process the false data,including the identification of attribute missing data,similar duplicate data and abnormal data,and identifying the results.The identified results are used to calculate the value of quantitative indicators in data availability evaluation.The paper mainly introduces the method of attribute differences to achieve recognition of column attribute missing data based on the digital sequence by looking for a sort of rules to achieve recognition of attribute missing records and using the improved field matching algorithm based on edit distance and nearest neighbor algorithm is used to identify a similar sort of duplicate data.The improved edit distance based field matching algorithm can deal with the reverse order of string and improve the universality of the algorithm.The improved nearest neighbor sorting algorithm solves the dependence of the original algorithm on the sorting key,and sets the window as a sliding window,which improves the recognition rate of similar repeated data.The second part is mainly to evaluate the availability of data,establish a data availability evaluation system,and determine the weight of each index.The usability of the data is evaluated from seven aspects: accuracy,consistency,completeness,uniqueness,timeliness,operability and applicability.The weight of each index in the data availability evaluation system is determined by expert scoring method and analytic hierarchy process.And finally calculate the data availability score,to achieve the different data sets to score,evaluate the availability of data.Finally,a data usability scoring system is designed to evaluate the availability of data sets,and the scoring results are reasonable and believable...

Keywords/Search Tags:

data availability, evaluating indicator, missing attribute data, approximately duplicate data

PDF Full Text Request

Related items

1	Similar Repetitive Record Detection Method In Uncertainty Database
2	An Improved Method For Detecting Incremental Approximately Duplicate Records Based On Clustering Tree
3	Research On Detection Of Approximate Duplicate Records For Massive Data
4	Research On The Method Of Approximately Duplicated Records Detection For Text Data In Big Data Envitonment
5	Reseaerch On Detection And Repair Of Structure Data Availability Violation
6	Research On Strategy Of Repairing Missing Data Based On Active Learning
7	Research On Multi-source Heterogeneous Large Data Cleaning Technology Based On Machine Learning
8	Research On Data Cleaning Method Based On Optimal Feature Selection
9	Research Of Key Technology In Massive Data Cleaning
10	Research On Cleaning Method For XML Similarity Duplicate Data