Research On Join Query Processing And Optimization Techniques In Cloud Computing Environment

Posted on:2015-10-14

Degree:Master

Type:Thesis

Country:China

Candidate:L Huang

Full Text:PDF

GTID:2298330431485917

Subject:Computer application technology

Abstract/Summary:

With the development of cloud computing and Internet of things technology, itproduces large amounts of data. Through joining and querying those data can be usedfor predicting the commercial point of view, designing application modules,analyzing of the user behavior. With the good features, MapReduce, the processingarchitecture of cloud computing can be achieved on large-scale data processing.However, when performing two tables and multi-tables join tasks on the MapReduce,there will be a large number of tuples that do not meet the join condition from theMap side to Reduce side, and they will bring a lot of shuffle stage time overhead andI/O overhead, especially for multi-tables joins. It will perform multiple MapReduceJobs, and the efficiency is very low. For the existing disadvantages of join tasks basedon MapReduce model, how to optimize the tasks becomes an urgent problem.First, in this paper, aiming at the shortage for processing join operations basedon MapReduce, we proposed the optimized strategy. When processing two tables joinbased on MapReduce model, we use mutual filtering policy based on extended BloomFilter. After the extended Bloom Filter compressioning process, the two tables joinattribute values are extracted respectively, and form the compressed file. Then thecompressed files are used to filter two tables do not meet the join condition. Theoptimized method of join, can be achieved to extend Bloom Filter and reduce thefalse positive rate, reducing shuffle phase time, improving the execution efficiency ofthe system.Second, in this paper, for multi-tables join task, we proposed the impovedpartition strategy, making a key/value pairs can be sent to multiple Reduce nodes.This method can make each Reduce node has some data of each table to perform joinoperation. Before the partition, we use the extended Bloom Filter to compress the joinkey, and form the compressed file, to filter multiple tables, and improve the efficiencyof multi-tables parallel join.Finally, we propose the sampling and packet join strategy. When multiple tables with multiple join property values perform join tasks, first we sample multi-tables,and sort each table by the filtering capacity. We extract the join key that come fromthe table that has a stronger filtering ability, forming the compressed file of eachattribute. At the same time, considering the cluster processing ability, when multipletables are joined over processing capability of a cluster, we perform the groupingprocess, and then join each groupâ€™s join results. This optimized method greatlyimproved the efficiency of the implementation of multi-tables join, and improved theoverall performance of the system.In addition, the optimized method proposed in this paper is verified through alarge number of experiments. Through the analysis of the experimental results, wecan see that the optimized strategy based on MapReduce model can reduce a lot ofcost in shuffle stage, and improve the efficiency of the system to perform join tasks,and improve the overall system performance.

Keywords/Search Tags:

MapReduce, Bloom Filter, join query and optimization, Partitioningoptimization

Related items

1	Research And Implementation Of The Aggregate-Join Query Optimization Approach Based On Mapreduce
2	Research On Query Analysis And Optimization Based On Spark System
3	The Optimization And Application Of Big Data Query Based On Bloom Filter
4	Research On Equi-Join Optimization Algorithms On Spark SQL
5	Design And Optimization Join Algorithms Based On Map Reduce
6	Top-k Join Query Processing Method Based On MapReduce
7	Research And Application Of SQL Join Optimization Based On Spark
8	Research And Implementation Of The Big Spatial Data Join Query Processing Algorithms In Cloud Environment
9	Design And Optimize Big-Data Join Algorithms Using MapReduce
10	Join Query Processing Over Delay Tolerant Network