Font Size: a A A

Method For Calculating Approximate Results Based On Resampling

Posted on:2017-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:J F LiFull Text:PDF
GTID:2348330509957111Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, there is explosive growth in the overall data size produced by various fields. However, big data analysis tends to consume very large resources and a very long time. And in many cases approximate results which are accurate enough and generated quickly are more popular to users compared to the exact results which are tardy in computation. When it comes to the approximate results on big data analysis, sampling is almost the only way that can reduce both the computing resources and the running time. However, in face of big data analysis, the simple random samples method tends to obtain a huge number of samples which makes the sampling of no use. And there are hardly any other sampling method for big data analysis which supports the mainstream distributed computing architecture(e.g., Map Reduce) perfectly. At the same time, in many cases, even when facing a same query request on the same data set, different users could have a different accuracy requirement for the approximate result. Thus, how to provide different users with different degree of approximate results has also become a problem to be solved.We proposed and implemented an accuracy controllable method which provides approximate results in big data analysis, based on Map Reduce computing architectures and sampling, to solve the problem of providing different users with different degree of approximate results. We take control of the precision of the approximate results by changing the sampling frequency in the big data sets. We also modified the kernel code of the Hadoop system to make the sampling method running efficiently and quickly in the distributed cluster. At the same time, the relationship between the accuracy of the approximate result and the resampling frequency are analyzed in detail.Finally, we verified that the system can not only reduce the running time of operation and the computing resources, but also provide the approximate result that is accurate enough, by a series of experiments in all kinds of data set. It demonstrates the validity and availability of our system results, and the advantages of the calculation scheme which provides accuracy controllable approximate result.
Keywords/Search Tags:big data, resampling, accuracy controllable computing, approximate result
PDF Full Text Request
Related items