Font Size: a A A

Research And Implementation Of Calculation Results Reusing Strategy Based On Distributed Computing Clusters

Posted on:2016-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:H XieFull Text:PDF
GTID:2298330452466404Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the level of enterprise information technology,data are exponentially generated and accumulated at high speed. How to manage and use hugeamounts of data will has a significant impact on business decisions and will be an important partof driving value growth. MapReduce is a parallel processing model for large-scale clusters, whichis becoming more and more popular for massive data processing.Traditional data warehouse cannot handle TB-level data within acceptable time, making Hive,a data warehouse tool based on MapReduce, widely applied. Since Hive parser will convertqueries to MapReduce workflows, and serial jobs in MapReduce workflow need to write interimresults to HDFS for next job reading, it causes a large amount of I/O. The time of job start andclear also Reduce the efficiency of data processing significantly. Similar queries cannot sharecalculation results. It’s a waste of computing resources. In order to solve this problem, this paperfocus on how to reuse calculation results of MapReduce workflow.1. This paper introduces the background and the related work of MapReduce and explains theimportance of reusing calculation results under massive data scenarios. Then we analyze theexisting research and summarize their characteristics and shortcomings. We also introduce HDFS,which is a data storage model, and analyze the implementation of MapReduce model. Based onthese, we discuss the advantages of Hive and give a glance of HiveQL grammar.2. Based on step1, this paper describes the abstract syntax tree and stage dependencies,which is generated by Hive parser. We analyze the process and principles of join and discuss thefeasibility of reusing calculation results based on Hive.3. Subsequently, this paper details the reuse strategy. We define joint-object, joint-graph,sub-joint-object, reuse-joint-graph and other data structures to describe calculation results, andpropose the algorithm to extract characteristics of calculation results. Based on these, we design and implement the calculation results matching algorithm, which can be used for single andmultiple joint-object. If there are more than one result are available, the optimal algorithm willgenerate the best reusing solution, which is based on the number of jobs and records numberproduct of all joint tables. Then we analyze the time and space complexity of our strategy in detail.In order to increase the reusing probability, we propose three methods, such as extra keys selection,delay of arithmetic and semantic understanding, and analyze their extra cost by experiment. Wealso propose a method of data management, which is based on the time consumption of jobs,reusing times, satisfaction of recent queries, etc.4. This paper verifies the strategy validity by elaborate experiments. We use two benchmarksto analyze the effectiveness of single and multiple joint-object reusing, and research the impact ofinitial queries. Through a series of comparative experiments, we verify the strategy proposed inthis paper not only can improve computational efficiency, but also have little impact on theefficiency of initial queries.
Keywords/Search Tags:MapReduce, Hive, calculation results reuse, Join-Object, datamanagement
PDF Full Text Request
Related items