The simplicity and flexibility of RDF (Resource Description Framework) model have attracted an increasing number of organizations storing their data in the RDF format. In the past decade, many researches focused on the techniques for RDF data processing in a centralized environment, such as data storage, index and query optimization. With the rapid growth of the data sizes, a single machine would not be capable to deliver satisfacto-ry query performance. The emerging need for conducting complex queries over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets.Comparing to the traditional centralized RDF data processing systems, in the distribut-ed RDF data processing system, RDF data would be partitioned among the computing nodes that connected by the network, in order to increase the storage and parallel computing ca-pabilities. However, the distributed data storage leads to new challenges for improving the performance of the RDF data processing. In the distributed environment, the intermediate outputs produced during query executions should be transferred cross the network. There-fore, in the most case, the communication cost is the dominant factor of the total query processing cost. Consequently, reducing the communication cost is the key challenge. In this paper, we concentrate on studying the key technologies of the distributed RDF data processing to address this challenge.First of all, in distributed SPARQL (Simple Protocol and RDF Query Language) query processing, the RDF data partitioning techniques have great impact on the network cost. Hash partitioning cannot leverage the semantic information in the RDF data, and thus induc-ing a large mount of the network cost. To address this problem, we introduce an semantic-based hybrid RDF data partitioning algorithm. Furthermore, we present a lightweight side-ways information passing technique. An extensive experimental study shows that our pro-posed techniques outperform existing techniques in terms of query response time.Queries over RDF data often involve complex distributed joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence incurs a large mount of the network cost. To address this problem, we propose a new RDF data partitioning framework, which adopts the end-to-end path as the basic partition element. We formally formulate the balanced path partitioning problem under this new framework and proof the problem is NP-Hard. In view of the hardness of the problem, we propose an approximate algorithm that provides a performance guarantee. To enhance the efficiency, we also present a bottom-up path merging algorithm to partition the paths. We conduct an extensive experimental study using two popular RDF benchmark datasets and one real RDF dataset. The results indicate that our approach can outperform the state-of-the-art methods by orders of magnitude.Optimizing query execution plans to minimize the cost of the network cost is crucial to achieve optimal query performance. To address this issue, we define a generic RDF data partitioning model to capture the common structure of various state-of-the-art data parti-tioning methods. Then we propose an optimization algorithm that not only has an optimal complexity, but also can accommodate different data partitioning methods so that it can be applied in different application scenarios. Furthermore, we propose two new methods to reduce the search space of the optimization algorithm and an autonomous approach to choose the right method by considering the structure and the size of a complex SPARQL query. The extensive experiments show that our algorithms significantly outperform the existing ones in both algorithm efficiency and query performance. |