Font Size: a A A

Research And Implementation Of Large Collections Of Rdf Data Distributed Storage On Domain Ontology

Posted on:2021-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y QiaoFull Text:PDF
GTID:2428330623968568Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of big data technology and semantic ontology technology,the scale of RDF data in the field has also continuously expanded.Nowadays,domain knowledge graph is constructed with RDF data of more than 10 billion orders of magnitude,such as Alibaba's core commodity knowledge graph,the collection of treasures of the British Museum knowledge graph,etc.,so it is the focus of current research to store and query RDF data based on domain ontology reasonably.For the storage of large-scale RDF data sets,the method based on relational database has poor scalability and can not make good use of the distributed characteristics.Some existing distributed storage methods improve the qurry efficiency while increasing the cost of storage space.Besides,they can not make good use of the semantics of domain ontology data for query and can not ontology reason to improve the query efficiency.In view of the shortcomings of existing RDF data storage and query schemes,this paper proposes a large-scale RDF data distributed storage scheme which based on HBase for domain ontology;then proposes a Hive-based query scheme for this storage scheme;and experiments results verify the feasibility and advantage of the scheme.The following research has been performed in this article:(1)Design and implement a distributed RDF data storage solution based on HBase.This solution first analyzes the domain ontology,and stores the relationships between the classes in an HBase table.With this table,domain ontology inquiries can be performed to improve the recall rate.Then,according to the characteristics of the SPARQL query statements in the standard test set,design the RDF data table in HBase,this table can reduce the number of self-joins and increase query speed.In addition,a filter is added to each HFile to speed up the reading of data during querying in this schema.(2)This paper proposes a query scheme based on hive corresponding to the RDF data distributed storage scheme proposed above.The query scheme mainly includes formulating SPARQL operator algebraic transformation rules,constructing a unique subject mapping table and completing it with a class relation table,establishing a Hive view,constructing and optimizing the abstract syntax tree of the query,generating HiveQL and executing the query by a MapReduce job.This query scheme implements the SPARQL-HiveQL transformation so as to have scalability and fault tolerance,and introduces ontology inference into the query.The optimization of the syntax tree also improves the query efficiency.(3)Implement the proposed storage and query schemes and design experimental verification this schemes.This paper sets up benchmark test sets with various sizes to test the load time of data,query time of query sentences,and recall rate,etc.,and verifies that this storage and query scheme which combining ontology inference has good performance in a distributed environment.
Keywords/Search Tags:domain ontology, RDF storage, HBase, MapReduce, hive
PDF Full Text Request
Related items