Join Method Research Based On MapReduce

Posted on:2015-02-23

Degree:Master

Type:Thesis

Country:China

Candidate:Q K Guo

Full Text:PDF

GTID:2268330428490983

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Network and Cloud computing technology developing rapidly leads to global dataemergs the situation of explosive increasing. The data is called figuratively Massive Data orBig Data. The value of hiding back of data becomes higher simultaneously, it not only canprovide decision-making and business opportunities for the enterprise owning the data, butalso can bring support for more convenient, intelligent, efficient service. Big data includesmore data types and more complex data structure. A variety of structured, semi-structured,unstructured data have been generated by varieties of application environments. Humanity iswelcoming the era of big data.Under the background of big data, the data value received unprecedented attention andmore and more people turn their attention to big data analysis and processing. Traditionalrelational data management and analysis technology, parallel computing technology cannotmeet the challenges brought by big data because of their own limitations. Therefore, newtheories and technologys are needed to support big data analysis and processing.As the representative of new programming model for data-intensive computingprogramming model, MapReduce has played an irreplaceable role in big data analysis andprocessing because of its good scalability, high tolerance, cheapness. However, thatMapReduce does not directly support the join brings difficulty for analyzing and processingrelational data. Join is one of basic operation in relational algebra, is the basis means ofrelational data analysis and processing.Existed MapReduce-based join method mostly only concerned equi-join. But simpleequi-join cannot complete the depth analysis job, more complicated join types such astheta-join and cross product be also needed. Rare Research focus on theta-join, or lack ofdetailed description, difficult to understand and implement, or cannot adapt the changeablecomputing environment.Based on the above reason, this paper proposed a simple and effective method forprocessing Theta-join using MapReduce. Simple embodied in easy to understand, describe indetail; Effectiveness reflected in the ability to set the Reducer’s number according to differentinputs to adapt to changeable computing environment. This Method is called Adaptive ShareMapReduce Theta(ASMRT), means MapReduce-based adaptive share theta-join algorithm. Itincludes two parts, MapReduce Theta(MRT) and Adaptive Share(AS). AS algorithmcalculates the shares of every dataset and Reducer’s number according to cardinality of everydataset. MRT algorithm processes theta-join according to the shares of data sets and Reducer’snumber. The theory model of MRT algorithm, MRT model, utilize the variable has norelationship with join record to logic partition, not only according to the partition logic of MapReduce processing any condition theta-join and make MapReduce processing theta-joinpossible, but also avoiding data skew problem caused by key uneven distribution in datasetsin nature. To illustrate the feasibility and effectiveness of the proposed algorithm, this paperimplement ASMRT algorithm, this paper analyzes the execution process of MRT from theperspective of relational algebra and analyzes AS using representative examples. The resultsshow that this algorithm can utilize a MapReduce procedure to process arbitrary conditionsmulti-way theta-join simply and efficiently.

Keywords/Search Tags:

Big Data, MapReduce, Theta-Join, Partition, Cloud Computing

PDF Full Text Request

Related items

1	Research And Design Of KNN-join Algorithm Based On MapReduce
2	Optimizing Multi-Join In Cloud Environment
3	Research And Implementation Of The Big Spatial Data Join Query Processing Algorithms In Cloud Environment
4	Research Of Join Algorithm With Skew Data On Mapreduce
5	Join Processing And Optimizing On Large Data Sets Based On Hadoop Framework
6	The Research And Implementation Of Comprehensive Mapreduce
7	Research On Complex Distance Measure Based MapReduce Similarity Join Techniques
8	Research On Partition Selection Strategy For Big Data Management Based On KNN Connection Processing
9	Efficient SPARQL Theta Join Processing On Large Scale RDF Graphs
10	Design And Implementation Of Similarity Self - Connection Algorithm For Massive Data Sets Based On MapReduce