Research On Data Storage And Query On Big Data Environment

Posted on:2015-12-21

Degree:Master

Type:Thesis

Country:China

Candidate:L Li

Full Text:PDF

GTID:2298330422991916

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In the era of the Big Data, many simple database operations become impracti-cal due to the growing size of the data. How to handle and analyze this massive ofdata becomes a challenging problem. To solve this big headache, researchers areproposing many novel methods and models on data storage, data transportation anddata analysis. Among all these new tools, Hadoop and MapReduce are the mostpopular big data storage and analyzing tools, which receive the approval from manyindustrial companies and academic researchers. Although MapReduce can solvesome of the problems, it still unfit to many scenario, so new methods are still inneed.We explore the storage method and the query method on big data mainly de-pends on CMD storage method. The traditional CMD storage method, which isbased on stand-alone multi-disk environment, is no longer fitful to the challenge thedatabase society is facing to. For the first time, we extend the CMD to a distributed,parallel environment, propose a CMD storage method on cluster and use it to solvethe multi-way theta-join on large data volume problem, design a brand new graphdata storage model and adapt the CMD to be able to store the high dimension dataand deployed on large cluster.For ordinary relation data, we propose a new multi-way theta-join algorithmbased on CMD storage method and compare the efficiency of it with traditional re-lation database and Hadoop distributed computation environment. Since this algo-rithm makes the best use of the index which is born with CMD, it is much faster thatthose algorithms on traditional relation database and Hadoop, and it could serve asan efficient solution of multi-way theta-join query on big data.For graph data, we adjust it to fit in the CMD storage method, explore the ef-fectiveness and efficiency and finally propose an graph model on CMD and presentsome basic operations. This is a brand new graph data model, which focuses moreon the edges than the vertices when compared to the previous graph data models. Itcan boost the efficiency of queries which mainly deal with edges. For the headaches that CMD used to be facing when confront with high dimen-sional data and large cluster, we propose some improvementof the classical CMDstorage. The attribute group notion we proposed can divide the attributes in groups,which can solve the large quantity of fragments when CMD is trying to store a highdimensional data. The cluster group notion we proposed can solve the scattered datafragments and the crowded network communication when CMD is deployed onlarge cluster.

Keywords/Search Tags:

CMD, Multi-way Theta-join, graph model, distributed environment

PDF Full Text Request

Related items

1	Efficient SPARQL Theta Join Processing On Large Scale RDF Graphs
2	Hadoop Based Efficient Join Algorithm Research On GPU
3	Join Method Research Based On MapReduce
4	Research On Optimization For Multi-way Join In A Map-Reduce Environment
5	Research On The Filtering Problem Of The ? Join Between Multi-way Data Streams
6	Research And Implementation Of Structural Join In XML Data Graph
7	Distributed Query System For Large Scale Knowledge Graph
8	A Study Of Multi-join Query Optimization Algorithm In Distributed Database
9	Distributed Database Multi-join Query Optimization Algorithm
10	Research On Join Query Optimization Algorithm In Distributed Database