Font Size: a A A

Research And Implementation Of Sampling-Based Aggregation Query System On Big Data

Posted on:2016-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:S S ZhangFull Text:PDF
GTID:2428330542957395Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,study on big data has gradually become the focus of the wide attention in academia and industry.Users hope that through the analysis of the big data dig out the hidden relationship between data,to get more in-depth,intelligent,with a reference value of information.Because of the data has the characteristics of large and sparse,the traditional Precise query system can not meet the user's requirements in terms of efficiency.At the same time,the user's query which have been proposed when in the analysis of the big data can be interpreted as the exploratory query which purpose is not clear.Its characteristic is users of its results are not very stringent in accuracy.Based on the above considerations,the big sparse data analysis and mining is the fundamental purpose of the query system.In order to achieve the goals which are keen find and fast exploration in the ocean data,combining with the research of sampling algorithm,through the theoretical research,algorithm design,system implementation and experimental validation phases in turn,finally completes the query system.The thesis focuses on the sampling-based aggregation query optimization algorithm on big sparse data The system can provide personalized sample update service according to the historical behavior of the user's query.We hope to get the approximate query results within the constraints of the error to weigh the accuracy and error rates of the query result.Therefore,the system which uses real and reliable data sets has carried on the related research in the following areas.First of all,we need to finish the theoretical position of the system.From the perspective of the data,processing theory analysis to the original data.From the perspective of the query,processing theory analysis to the mode of users' aggregate queries.Before building a stratified sample,we need to analyze the classification of query mode and determine the assumption of similarity relationship between historical query and future query,avoiding over-fitting occurs.The system eventually choose Predictable Query Column Sets model to guide the sample construction,to achieve improvements in query results on efficiency.Second,we determine the system's overall structure design divided into 2 parts which are offline calculation and online calculation.Among them,the offline part complete to build a sample pool through the Sample Creation Module of the system.Sampling scheme is designed to the combine of simple random sample(SRS)and stratified sample.By solving the problems which are reasonable choose the query column sets(QCSs)of the stratified sample and determine the number of tuples in each group of the stratified sample.The online part complete users' queries in real-time arrival through the Sample Selection Module of the system.According to the query of the given error rate and confidence level to determine a sample selection from the sample pool and the amount of secondary segmentation of the selected sample.Then,a combination of attributes involved in the query will be counted and analysis.It's directly affect the distribution of query column sets when updating the sample pool.System wants to provide users with personalized service in all aspects,including,including personalized update samples and the query of personalized interaction.In the end,because of the system wants to complete the distributed processing of the data,it deployed on MySQL Cluster.The system use the movie ratings data as raw data,finally achieve an efficient feedback to the query request through creates samples and selects samples,and the query results conform to the user's error constraints.By simulating the query request test sets,the system's "personalized" services were tested.The results show that with the change of the properties of the users concerned,presents corresponding collection of samples.As a whole,the system is more suitable for analysis and information mining on big sparse data than precise query system.
Keywords/Search Tags:Sparse Data, Query Pattern, Stratified Sampling, Approximate Query
PDF Full Text Request
Related items