Font Size: a A A

Research On Parallel Partitioning And Distributed Processing System Of Large-scale RDF Data

Posted on:2016-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:C F XieFull Text:PDF
GTID:2348330479453380Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Due to the flexibility and scalability of RDF(Resource Description Framework) data model, more and more communities have released their data in RDF format. Therefore,distributed storing and processing RDF data has been a hotspot. Though e xisting solutions have got certain achievement, most of them focused on the designing of distributed storage and optimization of processing, largely disregard ed balanced workload and minimal traffic.Hypergraph based parallel traversal tree partitioning and distributed processing system(ParTripleBit) presents a kind of technology which can partitioning and processing large-scale RDF data efficiently. It abstract RDF data with hypergraph model, then use traversal tree partitioning scheme to place the basic divisions into several co mpute nodes in parallel, which can maintain the relations between entities. In order to keep the data load balance and workload balance among compute nodes, triple placement strategy has been made. In addition, a heuristic scheme has been provided to decompose query tasks, which can minimize the decomposition. The async hronous and non-blocking communication model that MPI provided has been used in ParTripleBit, as well as a block level variable length integer delta compression scheme and parallel pipeline during interaction. In addition, a lock- free workstealing scheduler has been realized to schedule the query tasks. When collect the intermediate results, a batch merge operation has been realized to reduce the comparisons between keys.ParTriple Bit shows good performance while compared with five state-of-the-art RDF engines, including two centralized engines, TripleBit and RDF-3X, and three distributed engines, unone-on, dirtwo, and untwo-on. In partitioning, Par Triple Bit has several times time-saving in preprocessing, offer the minimum redundancy and best data load balance. In query processing, Par Triple Bit has a 40% performance improvement than three distributed engines, and several times even tens times performance improvement than two ce ntralized engines. In scalability, Par Triple Bit has a line or supline improvement in query processing while the compute nodes increasing, and a subline increase in execute time while the data size increasing. Thus Par Triple Bit has a good scalability.
Keywords/Search Tags:Hypergraph Model, Traversal Tree Partitioning, Distributed Processing, Asynchronous Communicate
PDF Full Text Request
Related items