Sharing Query Results In MapReduce Framework

Posted on:2019-05-20

Degree:Master

Type:Thesis

Country:China

Candidate:L Shi

Full Text:PDF

GTID:2348330569479550

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Due to the redundancy of the MapReduce framework,the outputs of the Map stage and the Reduce stage needs to interact with the HDFS multiple times,causing I/O overhead.Due to the overlap between the queries,a query has a high probability of being repeatedly executed,resulting in CPU overhead.The reuse of existing query results can significantly improve the efficiency of the system.To be resued,the results of queries need to be stored and matched.In order to achieve goal that the first match is the best match,the results of queries in the repository should be well organized based on the characteristics of the queries and the relationship between the query results,which inevitably result in additional systems Overhead.,In order to reuse the results of queries,the query to be executed has to be matched to a query result in the repository,which also results in additional overhead.According to the experimental analysis,the above additional system overhead accounts for 30% of the overall consumption of executing a query.In response to the issues of appeal,this paper studies the approaches to minimize the overheads to improve the performance of queryexecution under the MapReduce framework.In order to reduce the overhead for managing the result repository,this paper proposes a forest structure for organizing the results of MapReduce jobs.By using the characteristics of forest structure,the number of matches can be greatly reduced without affecting the matching results,thus reducing system overhead.In order to reduce the overhead caused by the matching operation,this paper proposes a matching algorithm that adapts to the forest repository.By analyzing the matching result between the query to be executed and the root node,it is determined whether it needs to match with the child node,thus reducing the number of matches.The matching algorithm uses the bad-character shift and good-suffix shift in the Boyer-Moore algorithm to improve the matching efficiency.On the basis of the appeal solution,in order to make the system fully reuse the results of the executed queries,this paper proposes a scheme of preprocessing multiple queries by changing the order in which the queries enter the Pig compiler for compilation,thereby changing the jobs' order of execution in a way that jobs loaded in the same dataset to execute simultaneously,reducing the number of matches with the repository.Finally,this paper designs and develops the reuse system based on MapReduce framework and deploys it on a Hadoop cluster.The test data is generated by the benchmark test set PigMix,and the reuse system of this paper iscompared with Restore,the state of the art reuse system.Experimental results show that compared with ReStore,the proposed approaches can significantly reduce system overhead and improve query execution performance.

Keywords/Search Tags:

MapReduce framework, forest structure, overhead, preprocessing multiple queries

PDF Full Text Request

Related items

1	Optimizing RDF Analytical Queries on MapReduce
2	The Research Of Skyline Queries Algorithms Based On MapReduce
3	Skyline Query Research For Massive RDF Data Under Distributed Computing Environments
4	A Multiple-Queries Processing Technique On Ziv-Lempel Compressed Texts
5	Optimizing multiple continuous queries
6	Research Of Distributed Network Crawler Based On MapReduce Framework
7	Research On CAD/CAE Analysis And Design System Of General Overhead Crane Structure
8	Parallel Ordinal Decision Tree And Decision Forest Based On MapReduce
9	Research On Key Issues Of Task And Job Scheduling For MapReduce Clusters
10	Research On Overhead Controllable Runtime Verification Framework Based On Predictive