Font Size: a A A

Research On Distributed Knowledge Management And Query Optimization Technology

Posted on:2022-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z HanFull Text:PDF
GTID:2518306602494914Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Knowledge graph is an effective means for knowledge management,RDF is the standard data model of knowledge graph.In view of the increasing scale of existing RDF data,singlemachine processing can no longer meet the complex business needs.At present,although the distributed RDF data management system has achieved certain research results,there are still some problems.For example,in large-scale data storage,the reasonable partition of data is not fully considered,resulting in low query efficiency;a large amount of storage space is consumed due to the redundant storage model and index structure,which makes data management and maintenance difficulty;SPARQL query calculation is large and dependent on cost estimation model excessively.In view of the above problems,the main research works of the thesis are as follows:Aiming at the problem of large-scale RDF data partition storage that does not fully consider the semantic association of data,which leads to high communication overhead during query.A RDF data partition algorithm based on semantic information and topology is proposed.The algorithm is divided into two stages.In the first stage,the local sensitive hash algorithm based on cosine distance is used to divide RDF data according to vertex similarity in linear time,which improves the speed of partition.In the second stage,according to the cluster structure of the graph,the local data is heuristically adjusted based on the number of adjacent vertices,and the result of the partition is further optimized.In view of large-scale RDF data multi-table redundancy storage and low complex query efficiency,a storage model of two-way adjacency linked lists based on the attribute graph model is proposed,and the columnar database HBase is used to store data.The model takes the vertex as the center and compresses all relevant attributes and edges of the vertex into a row in HBase.Combining the storage model,a predicate-based secondary index structure is proposed.Only one index table is used to assist the query,which ensures query efficiency and reduces the consumption of storage space.Aiming at the difficulty of SPARQL query optimization in distributed environment,according to the designed HBase storage model and secondary index structure,a join plan generation algorithm based on secondary index is proposed.Firstly,the triple pattern clause is divided according to the subject.Then,the clauses are reordered according to the index table to generate a new join plan.In order to further improve the execution efficiency,Spark is used to build the SPARQL query engine,and the SPARQL query mapping algorithm is proposed,which can submit the triple pattern in the SPARQL query statement as the Row Key query of HBase,and execute the query task efficiently.Finally,the method proposed in this thesis is experimentally verified on the standard data set LUBM.The experiment is mainly divided into three processes: graph partition,distributed storage,and SPARQL query.The cutting edge rate,balance factor,and time are used as evaluation indicators in the graph partition;Data import time and storage space are used as evaluation indicators in the distributed storage;Query response time is used as evaluation indicator in SPARQL query.The experimental results show that the proposed method in this thesis has great advantages in performance compared with some mainstream methods,which effectively improves the management ability of RDF data and plays a positive role in the research tasks of knowledge management,information retrieval,semantic web,social network and other fields.
Keywords/Search Tags:Knowledge management, Distributed storage, Graph partition, SPARQL query
PDF Full Text Request
Related items