Font Size: a A A

Research Of RDF Data Division And Storage Based On Hadoop

Posted on:2014-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:J ChengFull Text:PDF
GTID:2248330395995483Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Semantic Web is an extension of the current World Wide Web. Semantic Web adds semantic information, which could be automatically identified by computer, for World Wide Web to promote work together between computer and people and achieve the automated processing of data, thereby improving the efficiency of information retrieval. But with the rapid growth of the Semantic Web data, the storage and retrieval of RDF data faces serious challenges. Fortunately, the Hadoop platform MapReduce parallel framework and HBase distributed database meet the requirement of massive data queries and storage. This paper researches RDF data storage and loading based on Hadoop platform, the main research work and achievements are as follows:(1) We design a RDF data storage solutions based on the OWL and use HBase as the storage medium. This solution uses HBase as the storage medium, and designs multi HTables to store RDF data based on OWL sematic information. Firstly, we design NOSClass and NOSProperty HTable to save OWL semantic information, to provide a basis for reasoning and query optimization operation. And then, we design S_PO and O_PS HTable for each class defined in the OWL file, to store the triples of this class. At last, we design NOSType and NOSInstance HTable to store the triples whose predicate is "rdf:type".(2) We design an efficient parallel parse, divide and load RDF data algorithm. We firstly take a MapReduce job to parse RDF data and divide the triples based on the class which the subject of a triple belongs to. And then, we translate the divided triple files into HFile files. Later, we use Bulk Load instruction to load the HFile files into HBase cluster. At last, we verify the effectiveness of the proposed parallel parsing and loading RDF data algorithm.(3) We design a hybrid SPARQL optimization algorithm based on selectivity estimation and triple pattern grouping. We firstly classify triple patterns into seven types with triple pattern grouping optimization, and then we sort the triple patterns in each type with selectivity estimation optimization. Eventually, we get the optimized query execution plan. At last, we verify the effectiveness of the hybrid method.
Keywords/Search Tags:Semantic Web, OWL, RDF, parallel framework, selectivity estimation
PDF Full Text Request
Related items