Font Size: a A A

Research On Entity Resolution With Crowdsourcing And Probabilistic Models

Posted on:2018-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:W L RenFull Text:PDF
GTID:2428330512993972Subject:Financial Information Engineering
Abstract/Summary:PDF Full Text Request
Entity resolution is the problem of matching,identifying and resolving entities referring to the same real world entity.It is a long-standing challenge in many domains due to the inherent ambiguity of entities,such as database management,information retrieval and machine learning.And it becomes more complicated and difficult as the incoming of the time of big data.However,the emerging of crowdsourcing platform supports an effective solution for the problem of ambiguity.So this thesis is to explore how to leverage the wisdom of the crowd to solve the problem of entity resolution.This thesis is taken in three phases.Literature review is the first phase,investigating the application situations of entity resolution,especially for the crowdsourcing in entity resolution.The key for crowdsourcing platforms to solve the entity resolution is the questions' generation strategy,which is the generation strategy of Human Intelligence Task(HIT).This part initially introduce two HIT generation strategies of crowdsourcing,which are correlation-based and probability-based models.Compared the advantages and disadvantages of these two models,probability-based model is determined as the application model of this thesis.The second phase is to explore how to adjust and transfer relevant HIT generation models for solving the HIT generation question in entity resolution.Different with other HIT generation strategies,this thesis takes the respondents' error rates into consideration,exploring its potential influence to resolution efficiency and precision.Meanwhile,two probability-based HIT generation frameworks,namely best HIT generation framework(1EPMQ)and Multiple top HIT generation Framework(NEPMQ),are proposed.Besides,the multiple-HIT generation is proved to be a NP-hard problem in this phase.To solve this problem,we propose an approximate and a heuristic.The third phase set a stimulated experimentation on a small dataset,which is to evaluate the feasibility and effectiveness of the probabilistic models proposed in this thesis.According to the experimental results,relevant models and algorithms in this thesis are discussed.Through the evaluation,the probabilistic models and algorithms proposed in the thesis are proved to be practical and robust enough for entity resolution.
Keywords/Search Tags:Entity Resolution, Crowdsourcing, Probabilistic Models, Shannon Entropy, HIT
PDF Full Text Request
Related items