Font Size: a A A

Entity Resolution Based On Block Dependency In Big Data

Posted on:2016-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:M HuangFull Text:PDF
GTID:2298330467472462Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Entity resolution(ER) is a problem of identifying and linking/grouping different manifestations of the same real world object. It is widely used in database management, machine learning and information retrieval. Traditional ER focus on the accuracy of result with small data sets. While In the age of big data, owing to the time complexity, the traditional ER can’t deal with mass data sets. Therefore, we need more effective distributed technology to face the new challenges and problems that mass data sets bring out.An algorithm is proposed for parallel entity resolution based on block dependency to adapt to big data environment, which consists of three stages under MapReduce programming framework. Firstly, blocking is helpful for reducing the amount of calculation by setting the standard of blocking. Secondly, the entities which are of low dependency to the block that they belong to are picked out to match entities in other blocks. Using this kind of filtering strategy, not only the accuracy of resolution results is kept, but also the amount of calculation is reduced in some degree. Lastly, span distance is set to control the resolution quantity and further improve the efficiency. In addition, a loading balance strategy is designed and achieved on blocking stage and span distance stage which equally allocate the calculation and improve the efficiency again.By evaluating on Hadoop using real data set, experimental result shows that our algorithm is efficient and effective.
Keywords/Search Tags:Entity resolution, Big data, Block dependency, Blocking, Data filtering
PDF Full Text Request
Related items