Font Size: a A A

Research On Storage And Query Of Massive RDF Data

Posted on:2017-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:L S YuFull Text:PDF
GTID:2428330566453014Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Semantic Web technology and its wide application in various fields,the size of RDF dataset is becoming larger and larger.The traditional centralized RDF storage systemsarefacing a bottleneck in the scalability and performance of query processing.How to build a distributed RDF storage system with high scalability and high efficiency for SPARQL query has become a hot topic in the field of semantic web.However,existing distributed RDF storage systems still have some shortcomings.For example,systems based on HDFS have poor performance while querying RDF data,because HDFS don't have efficient index,and there is a high delay when using MapReduce to process SPARQL queries.Based on the above problems,we propose to use Cassandra to store RDF data,and use Spark parallel computing framework to implement SPARQL queries.In this thesis,we address an issue of how to improve the scalability and performance of query processing with RDF data.The main research work includes the following aspects:Firstly,an RDF data storage model based on the distributed database Cassandra was proposed.In order to provide basic query inference,we store reasoning information and triples separately.The triples are stored in SPO and POS tables according to SPARQL query features and the index mechanism of Cassandra.Compared with the previous storage solutions which use three tables,our solution will further reduce storage overhead,and ensure query performance at the same time.Secondly,SPARQL query processing based on Cassandra API was implemented.The SPARQL query processing is divided into three stages: SPARQL query parsing,query optimization and SPARQL query execution.In the query optimization phase,we reorder the query patterns of BGP to reduce the query cost according to the number of pattern variables and whether there are shared variables.In SPARQL query execution phase,we design triple pattern matching algorithm,inference algorithm and SPARQL BGP query algorithmand use Java multi-threaded programming techniques to implement all these.Thirdly,SPARQL query processing was implemented by integratingCassandra and Spark.The drawbacks of using MapReduce to implement SPARQL query processing are that the intermediate results need to be output to the HDFS and the frequent IO operations reduce the SPARQL query speed.In this thesis,we implement the SPARQL query processing by studying the parallel mechanism of Spark.Based on the above works,this thesisused the LUBM test dataset to test and analyze the loading and query performance of the system we built.The experimental results show that the storage model and SPARQL query implementation are effective.
Keywords/Search Tags:RDF Data, Distributed Storage, SPARQL Query, Cassandra, Spark
PDF Full Text Request
Related items