Research On Storage And Query Of Massive RDF Data

Posted on:2017-11-13

Degree:Master

Type:Thesis

Country:China

Candidate:L S Yu

Full Text:PDF

GTID:2428330566453014

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Semantic Web technology and its wide application in various fields,the size of RDF dataset is becoming larger and larger.The traditional centralized RDF storage systemsarefacing a bottleneck in the scalability and performance of query processing.How to build a distributed RDF storage system with high scalability and high efficiency for SPARQL query has become a hot topic in the field of semantic web.However,existing distributed RDF storage systems still have some shortcomings.For example,systems based on HDFS have poor performance while querying RDF data,because HDFS don't have efficient index,and there is a high delay when using MapReduce to process SPARQL queries.Based on the above problems,we propose to use Cassandra to store RDF data,and use Spark parallel computing framework to implement SPARQL queries.In this thesis,we address an issue of how to improve the scalability and performance of query processing with RDF data.The main research work includes the following aspects:Firstly,an RDF data storage model based on the distributed database Cassandra was proposed.In order to provide basic query inference,we store reasoning information and triples separately.The triples are stored in SPO and POS tables according to SPARQL query features and the index mechanism of Cassandra.Compared with the previous storage solutions which use three tables,our solution will further reduce storage overhead,and ensure query performance at the same time.Secondly,SPARQL query processing based on Cassandra API was implemented.The SPARQL query processing is divided into three stages: SPARQL query parsing,query optimization and SPARQL query execution.In the query optimization phase,we reorder the query patterns of BGP to reduce the query cost according to the number of pattern variables and whether there are shared variables.In SPARQL query execution phase,we design triple pattern matching algorithm,inference algorithm and SPARQL BGP query algorithmand use Java multi-threaded programming techniques to implement all these.Thirdly,SPARQL query processing was implemented by integratingCassandra and Spark.The drawbacks of using MapReduce to implement SPARQL query processing are that the intermediate results need to be output to the HDFS and the frequent IO operations reduce the SPARQL query speed.In this thesis,we implement the SPARQL query processing by studying the parallel mechanism of Spark.Based on the above works,this thesisused the LUBM test dataset to test and analyze the loading and query performance of the system we built.The experimental results show that the storage model and SPARQL query implementation are effective.

Keywords/Search Tags:

RDF Data, Distributed Storage, SPARQL Query, Cassandra, Spark

PDF Full Text Request

Related items

1	Research On SPARQL Query Engine Across Different Storage Platform
2	Research On Distributed Query Processing And Optimization Of RDF Data
3	Distributed Storage For Massive RDF Data Based Don Graphatabase
4	Design And Implementation Of Mobile Terminal Cloud Storage Based On Cassandra
5	Design And Implementation Of Storage And Analysis For WAT Data
6	Distributed Semantic Query Based On Sparql
7	Research On A Hashing Index Based RDF Data Storage And Query System And Its Application
8	Research On Distributed RDF Query Processing
9	An Ad-hoc Query Engine Based On Spark SQL
10	Research On Distributed Knowledge Management And Query Optimization Technology