Font Size: a A A

Data Cleaning On Probabilistic RDF Database Via Crowdsourcing

Posted on:2019-11-28Degree:MasterType:Thesis
Country:ChinaCandidate:Z WangFull Text:PDF
GTID:2428330548454464Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advancement of informationization,many applications use graph database such as recommendation system,knowledge graph and social networks.Uncertain data generates in these applications due to the factors such as errors and noises in the process of obtaining and analyzing data.Uncertain data can be stored in probabilistic databases and query facilities in probabilistic databases always yield answers with confidence.However,the accumulation and propagation of uncertainty may degrade the usability of query results.Uncertainty has emerged as an important problem.As such,it is desirable to reduce the uncertainty of uncertain database.In recent years,the topic on RDF database is very hot.RDF is mainly used for knowledge representation all over the world.At present,most cleaning problems mainly concentrate on relational database and schema matching while there is no research about probabilistic RDF graph database.If selecting edges by standards such as K-path betweenness centrality,the effect on query quality improvement is not satisfying.It is essential to design a new cleaning algorithm But cleaning whole data is unrealistic for large-scale probabilistic database.Just clean the data that can improve the quality of query results.This thesis deals with the problem about data cleaning on probabilistic RDF database to maximize query quality improvement within a limited budget.The advent of crowdsourcing platforms makes data cleaning more convenient.The data selected by cleaning algorithms can be cleaned via crowdsourcing.This thesis introduces the model about probabilistic RDF database and analyzes the problem about how to promote the answers' certainty about RDF graph query via crowdsourcing.The basic idea is to ask the crowd to decide whether the relationships represented by some edges are correct.Then this thesis introduces 3 different algorithms to select the edge which maximizes the uncertainty reduction.Naive algorithm needs to compute quality improvement for every effective edge.On this basis,pruning algorithm shrinks the scale of effective edges using two pruning ways.Besides,this thesis discusses the fast optimization algorithm when query satisfies Pr(PHI)= 0,which just needs to compute the information gain of the edge whose probability is closest to 0.5 for every effective attribute.Finally,by comparing these three algorithms with other edge selection methods such as WERW-Kpath algorithm,we get that the algorithms proposed in this thesis have perfect effects on quality improvement.In term of time efficiency,the pruning algorithm performs better than the naive algorithm.And when query satisfies Pr(PHI)= 0,the fast optimization algorithm performs best.The solution to this problem is significant for high quality retrieval on large scale RDF database.
Keywords/Search Tags:probabilistic RDF database, SPARQL query, crowdsourcing, data cleaning, query quality
PDF Full Text Request
Related items