The Research And Implementation Of Spark And NoSQL Databases Integration

Posted on:2017-07-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Li

Full Text:PDF

GTID:2428330569998546

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The explosive growth of data size in various industries puts forward an unprecedented challenge to traditional data computing and storage technology.As a memory computing engine,Spark can cope with numerous business scenarios for its several computing paradigms,including stream computing,machine learning,graph mining and structured queries,and is selected as data processing platform by more and more organizations.NoSQL databases,with flexible data model and excellent read/write performance,are widely applied to massive and diversified enterprise storage.However,Spark doesn't provide friendly computing support for the databases as the HDFS system.Industry on the application of Spark to deal with data in distributed storage gives only a few case of one to one solutions.Supposing that users have new data processing requirements and the existing systems go unmet,it's indispensable to choose a suitable storage and study the new solution,thus raising the processing difficulty of some application.The integration of Spark and NoSQL databases needs to solve the following problems:1 how to realize Spark processes NoSQL datasets in distributed parallel way.2 how to achieve Spark calculates the NoSQL datasets based on data locality as much as possible.In view of the above problems,this paper proposes an integration framework of Spark and NoSQL databases,which supports Spark parallel processing NoSQL datasets based on data locality.Specifically,the main content of this dissertation and innovation are summarized as follows:1.Based upon mechanisms of Spark and NoSQL databases,an integrated framework of the two is proposed,which supports Spark parallel processing NoSQL datasets based on data locality.This work mainly includes:1)Deeply analyzes cases of data analysis engines,such as Spark,Hadoop prosess data of NoSQL databases such as Cassandra,HBase.2)Studies design approach of distributed data source RDD,and gives data slicing principle,method of data locality calculation and slicing calculation,when NoSQL datasets is abstracted as RDD data model.3)Defines interface specifications of the integration framework.4)Designs two types of integration deployment architecture based on Co-Located.2.Studies and gives reference implementation of Spark and HBase integration based on the framework proposed in the paper,according to HBase storage model.This work mainly includes:1)Researches on the transformation of HBase datasets to RDD data structure.2)Appends computing support module for HBase in Spark source codes.3.Tests and analyses of the implementation case shows:1)Spark can compute HBase datasets in distributed parallel and data locality ways.2)The performance of the integration based on the framework is significantly better than general approach.3)The performance of fetching in Select/Project requests pushdown HBase server way gets better than Spark's own data filtering method.Thus,Experimental results validates the effectiveness of the integrated framework theory.

Keywords/Search Tags:

Spark, NoSQL Database, Integration, Massively Parallel, Data Locality

PDF Full Text Request

Related items

1	Research Of A Parallel Data Incremental Processing Mechanism Based On NoSQL Database
2	Research On Core Technologies Of The Encrypted NoSQL Database
3	SELF-CONFIGURATION OF THE MASSIVELY DEFECTIVE CELLULAR ARRAY (VLSI, PARALLEL COMPUTER, WSI (WAFER-SCALE INTEGRATION))
4	Research On CFD Parallel Computing Technology And Massively Parallel Computing Platform For Chemical Non-Equilibrium Flow Problems
5	Design And Implementation Of Big Data Platform For Financial Industry Based On MPP Database
6	Research And Development Of Big Data Storage Systems Based On Hbase
7	Analysis and Optimization Techniques for Massively Parallel Processors
8	Research And Application Of Proteomic Data Storage Technology Based On NoSQL Database
9	Design And Implementation Of Parallel Data Mining System Based On Spark
10	General Cloud-native Big Data Architecture With Kubernetes