Data intensive query processing for Semantic Web data using Hadoop and MapReduce

Posted on:2012-03-14

Degree:Ph.D

Type:Dissertation

University:The University of Texas at Dallas

Candidate:Husain, Mohammad Farhan

Full Text:PDF

GTID:1458390008498939

Subject:Computer Science

Abstract/Summary:

Semantic Web is an emerging area to augment human reasoning. Various technologies are being developed in this arena which have been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework (RDF). Semantic Web technologies can be utilized to build efficient and scalable systems for Cloud Computing. With the explosion of semantic web technologies, large RDF graphs are common place. This poses significant challenges for the storage and retrieval of RDF graphs. Current frameworks do not scale for large RDF graphs and as a result do not address these challenges. In this dissertation, we describe a framework that we built using Hadoop, an open source distributed file system supporting MapReduce programming paradigm, to store and retrieve large numbers of RDF triples by exploiting the cloud computing paradigm. We describe a scheme to store RDF data in Hadoop Distributed File System. SPARQL (SPARQL Protocol and RDF Query Language) is a language to query RDF data. We present an algorithm which can rewrite some SPARQL queries to equivalent simpler ones leveraging the storage scheme. More than one Hadoop job (the smallest unit of execution in Hadoop) may be needed to answer a query because a single triple pattern in a query cannot simultaneously take part in more than one join in a single Hadoop job. To determine the jobs, we present multiple algorithms, based on greedy and exhaustive search approach, to generate query plan to answer a SPARQL query. We extend those algorithms to generate query plans for complex SPARQL queries with OPTIONAL blocks. We use Hadoop's MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efficient and can handle large amounts of RDF data, unlike traditional approaches.

Keywords/Search Tags:

Semantic web, RDF, Data, Query, Hadoop, SPARQL, Framework

Related items

1	The Research On Structured Query Generation Framework Based On Semantic Query Graph
2	Semantic EMR Data SPARQL Query Optimization Mechanisms
3	SPARQL Federated Query And Its Application On The Semantic Web
4	Distributed Semantic Query Based On Sparql
5	Research On Distributed RDF Query Processing
6	Research On Semantic Web Service Discovery Based On Hadoop
7	Research On Linked Stream Data Query Method Based On SPARQL
8	SPARQL BGP Query Engine Based On BSP
9	A GA-Based SPARQL Static Query Optimization Method
10	Research On SPARQL Query Engine Across Different Storage Platform