Font Size: a A A

A Comparative Study On The Methods Of Agricultural Big Data Cleaning

Posted on:2018-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:X L QianFull Text:PDF
GTID:2359330518977606Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,data is produced and accumulated at an alarming rate,so the big data has come.As one of the important strategic resource after oil,the big data is playing an important role in the fields of medical,transportation,energy and so on.As the basic industry of our country,agriculture is in the process of changing from the tradition agriculture to the modern,resulting in a large number of data,and China is trying to build a large platform for agricultural data to serve the construction of modern agriculture.However,the quality of agriculture big data varies considerably as the level of agricultural informatization is relatively backward compared to other industries and the complexity of agriculture itself.High quality data is the key to the value of big data,so it is of great importance to evaluate the quality of agricultural data and then take measures to improve the quality of agricultural data.Based on the above background,this paper researched the cleaning methods of agricultural big data.Firstly,on the basis of the research of literature,the paper reviewed the theories of data cleaning and data quality as well as current situation of data cleaning methods,and then compares the different data cleaning tools existing in the market.Secondly,this paper summarized the present situation and constitution of agricultural big data in China and also presents the characteristics as well as challenges for data quality.From various dimensions of data quality,this paper designed specific steps and methods to evaluate the quality of agricultural big data by means of data quality survey and data quality index method and then analyzes the results in two-dimensional matrix.Finally,this paper focused on the cleaning method of similar and duplicated records and designed one similar duplicate data cleaning experiment using the Febrl.It selected the agricultural literature data in the database of web of Science,Science Direct,and Springer Link etc and compares the validity of Edit-Dist,Q-gram and Smith-Waterman-Dist algorithm in the field matching.According to the characteristics of the original data,this paper selected the improved algorithm of SNM algorithm as the methods for similar duplicate of data cleaning,in order to improve the efficiency of data cleaning.The result is evaluated and analyzed with three indexes of recall rate,accuracy and f-value.The results show that Smith-Waterman-Dist algorithm is better than the former two and the improved SNM algorithm is superior to the traditional one in accuracy and comprehensiveness of the records,which meets the needs of literature data in duplicated records cleaning.
Keywords/Search Tags:agricultural big data, data quality, data cleaning, duplicate data
PDF Full Text Request
Related items