Font Size: a A A

Research And Implementation Of Multi-Way Join Framework Based On Map-Reduce

Posted on:2013-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:X Y WangFull Text:PDF
GTID:2298330467974711Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the surge of data information and the rise of the concept of large data, processing and analysis of large data has been widespread concerned by diverse research fields. Because of the success of Google in data processing field, MapReduce framework becomes the most extensive and popular data processing framework. Hadoop MapReduce, as an open source community version of the Google MapReduce framework, has become one of the most popular open source projects. MapReduce framework divides the big data sets into small splits, and then parallel processes them. While the MapReduce framework shields the users from the details of parallel programming and the procedures of parallel coordination problems, as well as fault-tolerant problems in parallel processing.As the basic operation of information analysis, the MapReduce framework based on join operation has always been the research focus. While the MapReduce framework is not fitful for join process, many problems exist in the MapReduce based on join operation. In this paper, we propose the Share-Coordinate-MapReduce framework which is focus on solving the problems of intermediate redundant data when using the standard framework and the coordination of multi-tasks when processing multi-way join.Firstly, through the analysis of the reasons causing redudunant data of muti-way join, we propose the Share-MapReduce framework which is based on the mechanism of sharing Bloom Filter information. In order to relase the filtering of intermediate tuples, reduce the network data transmission and the burden of the I/O processing, we join the data set as an arranged order and share the join attribute information. For solving the problem of coordinaing multi-tasks, we propose the Coordiante-MapReduce framework which can coordinate the work flow of different tasks and reduce the waiting time bettween tasks. Then we also rule the join order which can maximize the framework of performance.Secondly, for the increase of the workload on master node and raising the probability of single point failure, we put forward the deployment stratagy based on virtualization technology. By using the virtualization framwork, we can monitor the operating environment dynamicly and solve the problems in advance with polling strategy. To ensure the performance of improved framework, we analyse the reliability and scalability of the framework. For better usage of the framework, we do the theoretical analysis of the Trade-off of the framework.At the end of this paper, we test the performance of SCMapReduce framework with manually generated network log files. The experiment shows that the improved framework supports multi-way join well especially for the sparse tables. In addition, through the management of virtualizaiotn framework, the master node can possess a favorable operating environment...
Keywords/Search Tags:Map-Reduce, join algorithm, share framework, Bloom Filter
PDF Full Text Request
Related items