Font Size: a A A

The Research And Implementation Of Spark And NoSQL Databases Integration

Posted on:2017-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LiFull Text:PDF
GTID:2428330569998546Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The explosive growth of data size in various industries puts forward an unprecedented challenge to traditional data computing and storage technology.As a memory computing engine,Spark can cope with numerous business scenarios for its several computing paradigms,including stream computing,machine learning,graph mining and structured queries,and is selected as data processing platform by more and more organizations.NoSQL databases,with flexible data model and excellent read/write performance,are widely applied to massive and diversified enterprise storage.However,Spark doesn't provide friendly computing support for the databases as the HDFS system.Industry on the application of Spark to deal with data in distributed storage gives only a few case of one to one solutions.Supposing that users have new data processing requirements and the existing systems go unmet,it's indispensable to choose a suitable storage and study the new solution,thus raising the processing difficulty of some application.The integration of Spark and NoSQL databases needs to solve the following problems:1 how to realize Spark processes NoSQL datasets in distributed parallel way.2 how to achieve Spark calculates the NoSQL datasets based on data locality as much as possible.In view of the above problems,this paper proposes an integration framework of Spark and NoSQL databases,which supports Spark parallel processing NoSQL datasets based on data locality.Specifically,the main content of this dissertation and innovation are summarized as follows:1.Based upon mechanisms of Spark and NoSQL databases,an integrated framework of the two is proposed,which supports Spark parallel processing NoSQL datasets based on data locality.This work mainly includes:1)Deeply analyzes cases of data analysis engines,such as Spark,Hadoop prosess data of NoSQL databases such as Cassandra,HBase.2)Studies design approach of distributed data source RDD,and gives data slicing principle,method of data locality calculation and slicing calculation,when NoSQL datasets is abstracted as RDD data model.3)Defines interface specifications of the integration framework.4)Designs two types of integration deployment architecture based on Co-Located.2.Studies and gives reference implementation of Spark and HBase integration based on the framework proposed in the paper,according to HBase storage model.This work mainly includes:1)Researches on the transformation of HBase datasets to RDD data structure.2)Appends computing support module for HBase in Spark source codes.3.Tests and analyses of the implementation case shows:1)Spark can compute HBase datasets in distributed parallel and data locality ways.2)The performance of the integration based on the framework is significantly better than general approach.3)The performance of fetching in Select/Project requests pushdown HBase server way gets better than Spark's own data filtering method.Thus,Experimental results validates the effectiveness of the integrated framework theory.
Keywords/Search Tags:Spark, NoSQL Database, Integration, Massively Parallel, Data Locality
PDF Full Text Request
Related items