Font Size: a A A

Research On Query Processing Technologies Over Large Scale Knowledge Graphs

Posted on:2022-10-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:X T GuoFull Text:PDF
GTID:1488306569484214Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Knowledge graph represents the real-world concepts,entities,and their relationships in a structured format.After the transformation,the huge volume of information on the Internet is easily understandable for people.Powered by semantic information and reasoning rules,knowledge graphs have been proved to be an efficient solution for all kinds of business intelligence in the internet era.The knowledge graph is one of the most significant cornerstones of artificial intelli-gence.The ubiquity uses of knowledge graphs rely heavily on competent query processing techniques.There are several challenges of the knowledge graph query processing:firstly,the sheer size of the knowledge graph makes storage and index management quite difficult in the distributed environment.Secondly,knowledge graphs are confronted with various workloads.A proper partitioning algorithm is critical to the stable query performance in such various workloads.Thirdly,knowledge graphs could be represented by the relational data model,as well as the graph data model.Consequently,the query engine should sup-port query types deriving from both models.The query processing algorithms should be designed according to the query-specific properties.Existing methods could not resolve these problems efficiently and effectively.Considering the above challenges,this thesis conducts systematic research on query processing techniques over knowledge graphs,by applying the methods in data manage-ment,algorithm design,and computation complexity theory.The main contributions are summarized as follows:1.This thesis studies the Multi-query optimization(MQO)problem in the context of knowledge graphs.MQO amounts to identify common sub-expression of a set of queries,whose results are used to assemble queries'final results.This thesis presents a new algorithm framework Leon to address the MQO problem.This thesis proposes a novel algorithm with superior pruning ability targeting the high complexity of current common sub-expression detection algorithms.It uses characteristic sets and triplets to quickly cluster queries and then discovers the high-quality common subquery in the clusters.Existing query rewriting methods merely consider the cardinality of common sub-query.However,the computation cost and memory consumption should be considered in the distributed setting.Based on the two factors,this thesis defines the rewriting cost and further defines the query rewriting problem.This thesis proposes an approximation algorithm with the guarantee of ln|Q|+1,where |Q| denotes the number of queries.The algorithm always chooses the lowest cost common sub-query to rewrite.This thesis also devises a subtle algorithm to generate query plans.The cardinality estimation based on characteristic set is more accurate than based on predicates.So the query plan is better,too.The extensive experiments on both real-world and synthetic datasets show that:for a single query,Leon is competitive with the state-of-the-art distributed systems;for the multi-query scenario,Leon outperforms 10x faster over the baseline method.2.This thesis investigates the workload-aware partitioning problem in knowledge graphs.The underlying partitioning keeps changing due to the incoming queries.In this case,the system could consistently provide good performance even in sophisticated workloads.This thesis proves the problem is NP-hard.Most workload-aware systems must know the query history in advance.This thesis proposes a workload-aware partitioning framework called WISE,which could adapt to the dynamic workload without knowing the history.This thesis proposes a new frequent query pattern mining algorithm,which encodes queries in a compact index structure called Query Span.To accelerate the construction of query span,this thesis also provides a hash-based approximation algorithm to check graph isomorphism.This thesis applies a dedicated algorithm to devise migration plans according to frequent query patterns.The algorithm decides which to move and where to move,to maximize the migration gain.Simultaneously,the load balance on each worker is preserved.Experimental results show that WISE is much more efficient than the competitors.WISE could adapt well to various workloads.3.This thesis addresses the regular path query(RPQ)problem on knowledge graphs.RPQ retrieves node pairs with the path between them satisfying some regular expression.RPQ is of great importance in financial networks,social networks,and protein interac-tion networks.This thesis presents Leon+,an algorithm framework optimized for RPQ specifically.Current RPQ processing algorithms rely heavily on auxiliary indexes,which have high computation costs and space consumption.This thesis proposes an index called CS graph,which could be computed in linear time and the size is small.Leon+takes advantage of join-ahead pruning and cardinality estimation via this CS graph.To reduce the communication cost,this thesis partitions the dataset using CS paths extracted from the CS graph.This thesis presents an efficient and precise two-staged query processing algorithm.It has a great pruning ability.Thus,the query response time is considerably reduced.The experiments show that Leon+produces great performance and shows strong scalability.4.This thesis investigates the Diversified Top-k Querying(DTQ)problem on knowl-edge graphs.Considering importance and diversity at the same time,we could provide users a more representative result set.This thesis formalizes the DTQ problem and proves it is NP-complete.The state-of-the-art method has a guarantee of 2 in O(1/2kn2)time,where n denotes the number of results.This thesis develops an approximation algorithm DTopk-Base,with a guarantee of 2 in O(kn)time.To reduce time overhead further,this thesis proposes a heuristic algorithm DTopk-Index to selectively generate high-quality results.The upper bound of DTopk-Index is 2/?,where ? ?[0,1]is a parameter to balance importance and diversity.Using both real-world and synthetic datasets,this thesis em-pirically verifies the efficiency and effectiveness of the two algorithms.The case studies show that taking both importance and diversity into consideration makes a better result set and improves users'satisfaction.
Keywords/Search Tags:Knowledge graph query processing, Multi-query optimization, Workload-aware partitioning, Regular path query, Diversified Top-k query
PDF Full Text Request
Related items