Font Size: a A A

Research On Large-scale Data Storage Model And Duplicate Data Detection Method In Food Traceability System

Posted on:2016-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:F S DongFull Text:PDF
GTID:2308330464969108Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Traceability system is usually considered an effective measure of ensuring product quality. In recent years, the problems of food quality and safety become more and more serious. As an effective method of supervision, food traceability system has been widely used. The system can supervise the whole links of food from its sources of raw material, processing to the sale in market. In the case, the risk of food quality and safety will be reduced by the system. However, as its high production of food, the data become large-scale for it includes all the links of food production. The data are usually difficult to deal with as it is too large and complex. So how to process the tracing data is the key in food traceability system.Based on the background above, the research on processing massive tracing data problem has been done in the paper. It mainly focuses on two aspects: the data storage model designing and duplicate data detection in the large-scale data environment. In data storage model designing, the model that both rational database and NoSQL are used to respond different requests is designed according to the feature of tracing data in the system. The action query tracing data is mainly responded by NoSQL database while the rational database is used to store the enterprise data. As the enterprise data are also utilized when query tracing food information, the data of this part need to be stored in NoSQL, too. Thus, in this strategy, as NoSQL database is the mirror database of rational database, the data need to be synchronized from rational database to NoSQL database while they are changed in rational database. So a data synchronizing strategy is researched in this part. The data synchronizing method using data middle-ware cache is proposed to synchronize the changed data. In the method, the data middle-ware maintains a cache module to store changed data and the changed data are synchronized to mirror database by the set condition. In addition, the middle-ware corrects data query results for mirror database by the cache data. Compared with traditional methods, the new method can not only heave high effective, but also provide the real-time query result.In duplicate data detection, a new method of detecting duplicate data is proposed based on traditional data cleaning method. The data in system are divided into two parts: the stored data and the incremental data and they are detected in different ways. For stored data, sorted-neighborhood method is used and it is improved by MapReduce in order to detect data concurrently. For the incremental data, the slipping windows comparing method in sorted-neighborhood method is replaced by jumping windows, which is more suitable for incremental data. Besides, MapReduce is also used to improve the method. In this way, data detection is accelerated and it avoids much invalid comparing. In all, the improved method can improve the speed of data detection greatly under its high detection accuracy.Through the research, a data storage model using both NoSQL and rational database is designed and data middle-ware is used to solve the data synchronization problem in the model. Through the data storage model, it shows a demonstration effect in tracing field. Besides, data quality is improved by the duplicate data detection method. It means a lot in ensuring the accuracy of data query and establishing data house in data mining.
Keywords/Search Tags:food traceability system, large-scale data storage model, heterogeneous data synchronization, duplicate data detection
PDF Full Text Request
Related items