A Comparative Study On The Methods Of Agricultural Big Data Cleaning

Posted on:2018-04-17

Degree:Master

Type:Thesis

Country:China

Candidate:X L Qian

Full Text:PDF

GTID:2359330518977606

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology,data is produced and accumulated at an alarming rate,so the big data has come.As one of the important strategic resource after oil,the big data is playing an important role in the fields of medical,transportation,energy and so on.As the basic industry of our country,agriculture is in the process of changing from the tradition agriculture to the modern,resulting in a large number of data,and China is trying to build a large platform for agricultural data to serve the construction of modern agriculture.However,the quality of agriculture big data varies considerably as the level of agricultural informatization is relatively backward compared to other industries and the complexity of agriculture itself.High quality data is the key to the value of big data,so it is of great importance to evaluate the quality of agricultural data and then take measures to improve the quality of agricultural data.Based on the above background,this paper researched the cleaning methods of agricultural big data.Firstly,on the basis of the research of literature,the paper reviewed the theories of data cleaning and data quality as well as current situation of data cleaning methods,and then compares the different data cleaning tools existing in the market.Secondly,this paper summarized the present situation and constitution of agricultural big data in China and also presents the characteristics as well as challenges for data quality.From various dimensions of data quality,this paper designed specific steps and methods to evaluate the quality of agricultural big data by means of data quality survey and data quality index method and then analyzes the results in two-dimensional matrix.Finally,this paper focused on the cleaning method of similar and duplicated records and designed one similar duplicate data cleaning experiment using the Febrl.It selected the agricultural literature data in the database of web of Science,Science Direct,and Springer Link etc and compares the validity of Edit-Dist,Q-gram and Smith-Waterman-Dist algorithm in the field matching.According to the characteristics of the original data,this paper selected the improved algorithm of SNM algorithm as the methods for similar duplicate of data cleaning,in order to improve the efficiency of data cleaning.The result is evaluated and analyzed with three indexes of recall rate,accuracy and f-value.The results show that Smith-Waterman-Dist algorithm is better than the former two and the improved SNM algorithm is superior to the traditional one in accuracy and comprehensiveness of the records,which meets the needs of literature data in duplicated records cleaning.

Keywords/Search Tags:

agricultural big data, data quality, data cleaning, duplicate data

PDF Full Text Request

Related items

1	Loan Model Based On Low Quality And Small Sample Data
2	An Approach To Enterprise Data Rationalization Enalber
3	A Case Study On The Problem And Countermeasure Of Sunk Data
4	Data Quality Management Research On Enterprise Implement Of ERP System
5	Research On The Management Of Tax Data Quality After Tax Data Concentrated
6	Gansu Branch Of Agricultural Bank Of China Customer Data Of Marketing Research Under The Background Of Data Concentration
7	Research On ERP System Data Quality Optimization Management From The Perspective Of Informationization
8	Research On Data Asset Management And Utilization Of External Data
9	The Research On P2P Loan Default Risk Identification Model Based On Data Mining Technology
10	Research On Analysis And Application Of After-sales Quality Data Of KT Company Products