Font Size: a A A

Research On Online Aggregation Query Processing Based On Hadoop

Posted on:2016-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:J H HuFull Text:PDF
GTID:2348330542975726Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Since entering the information era,the information increases explosively,which followed by a sharp increase in the amount of data.Getting useful information by processing massive data has become more and more urgent.For a long time,store and process data by RDBMS,and aggregation query is an important operation in statistical analysis.With the rapid growth of the amount of data which need to be processed,user need to wait a long time for getting an accurate aggregation result because of the batch mode of traditional relational database aggregation queries.Online aggregation queries can continue to give approximate results in the process of query processing based on current sample data,until all data is processed,we can get final result.When the precision of result arrival user desire,user can stop the query to save the user time and system resources.With the development of Hadoop,processing volume data get more efficiency.But the data is “limitless”,while computational and storage resources are limited.At present,it is hard to fundamentally solve the problem,but we are still able to propose some specific solutions for specific applications.We propose a Hadoop-based iterative sampling approximate aggregate query processing method by combining the advantage of Hadoop processing volume data and online aggregation query mode.The desired precision of user's approximate aggregate query results can be met by two iterative sampling.According to the user desire precision and the sample data which is the first iteratively sample,we compute the sample size which to meet the user desired precision.Use the sample data obtained from the two samples to return the approximate aggregate query results to the user.In order to avoid the effects of data bias,the paper propose a “layered sampling” method to ensure that the approximate aggregate result is statistical meaningful.Finally,in the experiments,we comparatively analyzes the effects on aggregation query result of various sampling methods,and the results show that our “layered sampling” method not only consider the time efficiency,which let user make a trade between processing time and the precision of result,but also it considers the usage of computational and storage resources of Hadoop cluster.we make a comparison in efficiency between the newest method of online aggregation base on Hadoop and our iterative aggregation query method,the experiment result indicates that our method is more efficient.
Keywords/Search Tags:Online Aggregation, MapReduce, iteratively sample, sample size estimation
PDF Full Text Request
Related items