AGA-Based SPARQL Static Query Optimization MethodWith more and more Semantic Web applications are developed, research issues about efficiently storing and querying of RDF data in Semantic Web are becoming more urgent. In the past ten years, many RDF storage tools have been developed, among which Jena, Sesame and RDF-3X have been widely used both in the research and industry fields. They all provided SPARQL query models. At the same time, optimization of SPARQL query execution was also considered.The core component of SPARQL language is the basic graph pattern, which can be considered as conjunctive query composed of a series of triple patterns. A solution of a query is generated by join operation of the result set corresponding to each triple pattern in the query. With this consideration, a SPARQL basic graph pattern can be easily taken as a query expression with multiple join operators. Meanwhile, the native triple structure of RDF triples in the form of<subject, property, object> leads to SPARQL queries on RDF data with a large number of joins, thus the possible join order of these join operations is much more complicated, which makes the query optimization problem in semantic web context more considerable. As different ordering of triple patterns in a BGP will have quite different execution costs, which introduces SPARQL static query optimization problem for the optimal join ordering of triple patterns.In this paper, we outline the problem of multi-join ordering query optimization in the semantic web scenario, also known as SPARQL static query optimization. Many works have been done in this field, including greedy heuristic optimization methods and dynamic programming methods. As for heuristic methods, optimization algorithm only explores limited left linear plans, thus can’t get optimal solution in many cases for this incomplete solution space. Some other work considered dynamic programming based algorithm for query plans enumeration, though more bushy tree search space were exploited, it can easily lead to high optimization cost which was unaccepted in real-time query cases.Unlike most previous researches, we concentrate on more general SPARQL query forms and devote ourselves to the problem of searching for the optimal query plan within a more complete search space-bushy plan space. We model the BGP reordering optimization as a genetic evolution problem and implement a genetic algorithm on the open-source Jena ARQ System.Motivated by the observation that clique structures occurred in the SPARQL query graph frequently, and with the analysis of genetic algorithms for the optimization of complicated problems, we proposed a SPARQL BGP query optimization method based on delicate design of a genetic algorithm. Our optimization method searches for the optimal query plan in the bushy tree space, which is more integrated than the usual left-linear tree space. As a result, better optimal solution can be generated for more complicated query patterns. To largely exploit the efficiency of genetic algorithm, some main issues are carefully considered, such as designing of chromosome encoding, selection and genetic operators (crossover operator and mutation operator). A fitness function is also developed to calculate the fitness of individuals in each generation.As for the experimental part, LUBM, a widely used benchmark, is applied. We use the auto RDF data generator wrapped with the LUBM toolkit to generate RDF datasets of different scale. Based on the LUBM standard queries, some more complicated queries are created manually. Altogether, two experiments with comparative analysis are designed to exploit the performance of our GA-based SPARQL static query optimization method. As two comparative methods, Jena ARQ with a heuristic SPARQL query optimization algorithm and RDF-3X with dynamic programming method are considered. Our experimental results reveal that, with carefully setting of genetic algorithm parameters, our method can lead to a comparable query response performance with considerable optimization time. |