Font Size: a A A

Research On Key Technologies Of Big Data Currency

Posted on:2017-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y T GaoFull Text:PDF
GTID:2308330503487179Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the big data era coming, massive data is widely applied in the enterprise and people’s daily life. Data quality plays an important role in data application, and data currency problem is one of the main effect factors in data quality. In real application, timestamp is often not complete or even absent, the research of data currency problem now mainly includes using temporal orders and currency constraints and the reference data to reason the most current attribute values of each entity; combining currency rules and conditional functional dependencies to improve data currency and consistency; combining currency rules and statistical techniques to improve data currency. According to the characteristics of big data including volume; velocity; variety, we propose the crucial technologies of big data currency. The study in this paper mainly include the following aspects:According to the feature volume of big data, we use distributed processing framework Map Reduce to deal with massive data, and we reduce big data currency to K-Partition problem(NP complete). Then we extend the approximation algorithm of 2-Partition problem and propose a distributed approximation algorithm to optimize load balance in reduce phase of Map Reduce, and the approximation ratio is close to 1. Experiments verify that the efficiency and accuracy of load balance MapReduce.According to the feature velocity of big data, we propose the model of dynamic data currency. Firstly, preprocessing original dataset and sorting the recoreds pertaining to the same entity with currency rules. Then dynamically and real-time deal with updating data. Meanwhile, we improve the algorithm efficiency from the following aspects: creating entity query B-tree to improve the efficiency of finding the corresponding entity; introducing entity storage static linked list to reduce the response time of updating dataset; according to the rules of limitation of creating the topological graph of attribute pcocessing order and the reverse index of attribute values and tuple IDs to optimize the processing of currency rules.According to the feature variety of big data, we consider data currency and consistency, accuracy, integrity, identity together, using currency rules, conditional functional dependencies, matching rules and master data to deal with multi-source mixed data, improving data availability. Meanwhile, improving integrity by filling attribute values with the nearest currency value. Experiment shows accuracy of data integrity filling by the algorithm is high.
Keywords/Search Tags:big data, data quality, data currency, volume, velocity, variety
PDF Full Text Request
Related items