Font Size: a A A

An Online Big Data Analytic System Leveraging Uncertain Query Processing

Posted on:2018-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:M J JinFull Text:PDF
GTID:2348330512483400Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the mature of big data analysis technology,large data has more and more attention for its huge value.Because of the huge amount of data,analysis of big data is usually very time consuming.However,in many cases,the user does not need accurate query results.The outline of data can meet the demand of most of the analysis.This paper studies and completes a big data analysis and processing system based on fuzzy query.In this system,we defines a set of query interface for user.The interface allows users to do all kinds of aggregate query(Group By).System will return a fuzzy results to user.This system can process hundreds of G data in second level.Online aggregation technology has the characteristic that it can generate data outline quickly.So we introduce this technology in our system.At the same time,adjacent query results usually overlapping in our system.If we are able to stores the acquired samples and the intermediate results that produced in previous queries,we can speed up the system of dealing with the queries.First,we randomize the dataset to generate a random dataset,so that we can scan the random datasets sequentially to achieve the effect of randomly retrieving the data set.Then,we use the online aggregation technology to handle the user's query operation.The online aggregation technique generates the query results and stores the acquired samples and the intermediate results in a sample management tree.Accordingly,the user's query will also be first processed in the sample management tree.When the results generated from the tree can't meet the accuracy that set by user,we continue to read data from the data source.In this way,the sample and intermediate results can be effectively used by multiple queries.We uses a number of statistical methods to integrate multiple intermediate results to generate the final query results.Finally,the experimental results on the TPC-H benchmark prove the effectiveness of the technology.
Keywords/Search Tags:Online Aggregation, Sample, Confidence Interval, Tree
PDF Full Text Request
Related items