Join Query Optimization For Large-Scale Data Based On New Computing Architecture

Posted on:2017-04-19

Degree:Master

Type:Thesis

Country:China

Candidate:H J Shang

Full Text:PDF

GTID:2348330509455404

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The performance improvement of large-scale data is one of the core challenges of big data service domain, at the same time, join operation is a primary operator for relational model database and large-scale data analysis, whose processing has an important influence on the performance of large-scale data analysis. Map Reduce is a classical dis tributed computing architecture for large-scale data due to its high scalability and high availability. For the analysis efficiency of large-scale data, it will present obvious value from both research and applications to focus on the performance optimizat ion of join query operations on large-scale data under Map Reduce architecture.Distributed computing scheme is the fundamental idea of big data processing, but for the large-scale data with different characteristics and new hardware architecture, the join query operations under Map Reduce can still gain a big performance improvement. First, the nonuniform distribution presented by the data for join operations will cause load imbalance of the computing nodes in Map Reduce environment, which will reduce the efficiency of join operations and the performance of large-scale data analysis. Second, though multi-core processors have become the standard configuration in current computing clusters, Map Reduce has not given a full consideration for the thread-level parallel computing ability, which also provides the optimization space for join operation. This paper focuses on studying the performance optimization of join query based on MapReduce scheme, which aims at reducing the influence of nonuniform data distribution and integrating the micro parallel computing abilities of multi-core processors to exploit the computing platform potentiality and improve the performance of join operations, and then provide operator- level optimization solutions for large-scale data analysis.In view of the above requirements, the main research problems and achievements are as follows,Firstly, we conducted survey and experimental analysis on those typical join query algorithms under traditional Map Reduce environment. On the unified experimental platform, we realized those typical join query algorithms, and then designed experiments for algorithm comparison and analysis under the same datasets and from different perspectives. The experimental results present that the improved repartitioning join algorithm has better time performance and stability under the traditional Map Reduce computing architecture.Secondly, aiming at the load imbalance brought by the nonuniform data distribution, we designed and implemented an optimization strategy by int egrating combination segmentation and equilibrium partitioning, which is merged into the join query algorithm under improved Map Reduce computing architecture. This method applies combination partitioning for lightly skewed data groups and segmentation partitioning for heavily skewed data groups, which can improve the performance of the improved repartitioning join algorithm under traditional Map Reduce when facing uneven data. The experimental results show that this proposed optimization strategy integrating both combination segmentation and equilibrium partitioning is good solution for improving the performance of join query with uneven distribution under the Map Reduce scheme, which presents a good time performance and scalability.Finally, aiming at the micro parallel computing ability provided by multi-core processing architecture, we designed and implemented an optimization strategy of non-competitive data fragmentation inputting, and then put forward a join query algorithm with multi- thread processing in Map phase under improved MapReduce computing architecture. This method applies both equal-size data fragmentation for Map phase and non-competitive data fragmentation inputting for multiple threads to join query, which can give a full consideration on the thread- level parallel computing ability and give a deep performance strengthen on the improved repartitioning join algorithms. The extensive experimental results show that the proposed optimization strategy can making full use of the micro parallel computing ability of multi-core processing, and improve the efficiency of the join query algorithm based on Map Reduce computing framework, which presents a better time performance and scalability.

Keywords/Search Tags:

Join Q uery, Map Reduce, Data Skew, Multi-core Processor

PDF Full Text Request

Related items

1	Earch On Data Skew In Join Base On Hadoop
2	Research On Optimization For Multi-way Join In A Map-Reduce Environment
3	Research And Implementation Of Multi-Way Join Framework Based On Map-Reduce
4	Research And Implementation Of Skew Join Optimization Technology On MyCat
5	Optimization And Implemetation Of Parallel Join Algorithm
6	Research Of Join Algorithm With Skew Data On Mapreduce
7	Optimization And Research On Reduce Task Scheduling Strategy And Data Skew On Hadoop
8	Research On Key Technology Of Multi - Core Processor
9	The Optimization Of Hash Join Algorithm Based On KNL
10	Research On Some Key Technologies Of Parallel Processing For Big Data Based On Map Reduce