Data Cleaning On Probabilistic RDF Database Via Crowdsourcing

Posted on:2019-11-28

Degree:Master

Type:Thesis

Country:China

Candidate:Z Wang

Full Text:PDF

GTID:2428330548454464

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the advancement of informationization,many applications use graph database such as recommendation system,knowledge graph and social networks.Uncertain data generates in these applications due to the factors such as errors and noises in the process of obtaining and analyzing data.Uncertain data can be stored in probabilistic databases and query facilities in probabilistic databases always yield answers with confidence.However,the accumulation and propagation of uncertainty may degrade the usability of query results.Uncertainty has emerged as an important problem.As such,it is desirable to reduce the uncertainty of uncertain database.In recent years,the topic on RDF database is very hot.RDF is mainly used for knowledge representation all over the world.At present,most cleaning problems mainly concentrate on relational database and schema matching while there is no research about probabilistic RDF graph database.If selecting edges by standards such as K-path betweenness centrality,the effect on query quality improvement is not satisfying.It is essential to design a new cleaning algorithm But cleaning whole data is unrealistic for large-scale probabilistic database.Just clean the data that can improve the quality of query results.This thesis deals with the problem about data cleaning on probabilistic RDF database to maximize query quality improvement within a limited budget.The advent of crowdsourcing platforms makes data cleaning more convenient.The data selected by cleaning algorithms can be cleaned via crowdsourcing.This thesis introduces the model about probabilistic RDF database and analyzes the problem about how to promote the answers' certainty about RDF graph query via crowdsourcing.The basic idea is to ask the crowd to decide whether the relationships represented by some edges are correct.Then this thesis introduces 3 different algorithms to select the edge which maximizes the uncertainty reduction.Naive algorithm needs to compute quality improvement for every effective edge.On this basis,pruning algorithm shrinks the scale of effective edges using two pruning ways.Besides,this thesis discusses the fast optimization algorithm when query satisfies Pr(PHI)= 0,which just needs to compute the information gain of the edge whose probability is closest to 0.5 for every effective attribute.Finally,by comparing these three algorithms with other edge selection methods such as WERW-Kpath algorithm,we get that the algorithms proposed in this thesis have perfect effects on quality improvement.In term of time efficiency,the pruning algorithm performs better than the naive algorithm.And when query satisfies Pr(PHI)= 0,the fast optimization algorithm performs best.The solution to this problem is significant for high quality retrieval on large scale RDF database.

Keywords/Search Tags:

probabilistic RDF database, SPARQL query, crowdsourcing, data cleaning, query quality

PDF Full Text Request

Related items

1	Research On The Methods Of Uncertainty Data Indexing And Querying In Mobile Environments
2	An Improved Probabilistic Database Model And Its Probabilisticn Earest Neighbors Query Research
3	Research On Duplicate Detection And Cleaning Of Uncertain Data
4	Research On SPARQL Query Engine Across Different Storage Platform
5	Study And Application On The Method Of Information Query Processing Based On The Crowdsourcing
6	SPARQL Federated Query And Its Application On The Semantic Web
7	Research On Answering Why-not Questions Over Probabilistic Reverse Top-k Queries
8	The Research On Structured Query Generation Framework Based On Semantic Query Graph
9	Uncertain Graphs Cleaning For Reachability Queries Via Crowdsourcing
10	Research On Distributed RDF Query Processing