Font Size: a A A

Semantic Search On Large Scale RDF Data

Posted on:2014-04-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:H F WangFull Text:PDF
GTID:1268330422988721Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Semantic Web is to add explicit semantics for data so that machines can notonly display these data, but also understand, process and even integrate them. Inrecent years, with the fast development of the Linking Open Data (LOD) projectand the DBpedia project, Semantic Web data sources increase significantly anda large number of graph data in form of RDF are published. The Web is rapidlychanging from the one containing Web pages and hyper-links only (called DocumentWeb) to a Data Web with abundant entities and rich relationships between entities.Large search engine companies like Google are building knowledge graph based onthese semantic data to improve the quality of search, which indicates that we enterthe era of semantic search.Diferent from the traditional document retrieval, semantic search has to dealwith finer-grained structured semantic data, which is a grand challenge. The well-optimized storage and indexing techniques specified for Web documents are nolonger suitable for RDF data. The state of the art ranking algorithms cannotbe directly applied to the semantic search scenario focusing on entities and theirassociations. How to support SPARQL query processing and how to handle dataintegration from heterogenous data sources are totally unexplored problems. Fur-thermore, how to allow end users to express their information needs using keywordqueries are crucial to the wide adoption of semantic search.This thesis aims at systematically tackling the following challenges raised bysemantic search on large scale RDF data: scalable graph data storage and indexing,efcient query processing with both graph constraints and keywords, efective entity-oriented structural ranking, heterogenous data integration from multiple sources,and friendly user interaction and search interface. The contents and the maincontributions of each chapter are listed as follows.Chapter1is the overview of the thesis. It summarizes the research literatureof semantic search and describes the main challenges of semantic search on large scale RDF data in details.Chapter2is the first work to use an information retrieval approach to searchthe Data Web. By using and extending inverted indices, efcient query process-ing for unary tree-shaped hybrid queries is supported. In addition, we propose arelation-based ranking algorithm to return relevant entities, use faceted browsing toallow users interactively construct hybrid queries, and design a block-based indexextension to support incremental index update for handling data changes.Chapter3extends the expressivity of the structured query language introducedin Chapter2, and introduces our work on building an efcient RDF query engineto execute more general SPARQL queries. Moreover, we collect some given RDFdata statistics to estimate the cost of a query plan. Then we design a novel queryoptimization algorithm to determine the optimized join order, and transform thegiven SPARQL query to the best query plan accordingly.Chapter4discusses the efcient query processing based on RDF graph patterns.This chapter introduces two pattern selection mechanisms. One is based on someheuristic rules to select frequent RDF subgraphs while the other makes use of queryhistories to select user preferred graph substructures. Beyond the index structureintroduced by the above two chapters, we design an efcient index for RDF patternsinstead of triples. We further represent a query plan by the corresponding patterntree and apply sub-pattern covering algorithm to answer SPARQL queries.Chapter5proposes a two-stage solution to tackle instance matching issues inthe scenario of semantic search over large scale RDF graph data. We use blocking toquickly filter out candidate instance pairs so as to meet the scalability requirement.We then leverage the local structural characteristics of entities to perform clusteringin each block, and get the final matches. This is the first work to leverage exist-ing sameAs triples from Linking Open Data to assess the efectiveness of instancematching in a large scale.Chapter6studies a novel and user-friendly interaction for keyword search.That is to perform keyword-based query interpretation efectively on large scalegraph data, especially RDF data. We propose a novel top-k subgraph search algo-rithm to transform keyword queries to the corresponding structured query insteadof calculating answers directly. We then leverage summarization techniques to gen-erate a summary graph containing necessary graph schema information for further query interpretation speedup.Chapter7introduces a pay-as-you-go Data Web search infrastructure. Thischapter extends query interpretation to heterogenous Web data sources. That is, wetranslate the user keyword query into a semantic query spanning multiple sources.Moreover, we introduce detailed algorithms of distributed query processing overData Web. In particular, we design a special join algorithm called map join. It usesthe large scale instance matching algorithm introduced in Chapter5to pre-computedata-level mappings, and merge merge results derived from heterogenous sources.Chapter8extends the semantic search scenario to a hybrid Web environmentcontaining graph structured data, Web pages, as well as semantic annotations onthese pages. By integrating information retrieval technologies with database tech-nologies, we build a scalable data repository which can handle a large amount ofdocuments, graph data and semantic annotations. Furthermore, we propose a noveldata structure to represent (intermediate) results returned by hybrid search, anddesign a series of efcient algorithms for answering hybrid queries.Chapter9summarizes the research results and main achievements of this thesisand lists several future work of semantic search.
Keywords/Search Tags:Semantic Search, Hybrid Querying, Graph Data Indexing, QueryOptimization, Instance Matching, Query Interpretation, Ranking
PDF Full Text Request
Related items