Font Size: a A A

Research And Application Of Parallel Entity Resolution Based On Hadoop

Posted on:2015-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y S ZhangFull Text:PDF
GTID:2268330428456501Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Entity Resolution(ER) is used to determine whether two data records describe the same entity object in real world. It is central to data integration, data cleaning, deduplication tasks and optimizaton tasks. Objects that ER handles are not only limited to records, but also in other fields such as text files comparison, documents deduplication, facial image recognition, fingerprint identification and so on. We can also apply the basic ideas and ways of ER into solving these specific problems. Thinking from the subjects who participate in ER, we can divide methods of ER into two main categories, namely the machine algorithms-based ER and the human-based ER. Although machine-based algorithms alone can bring in high efficiency, it can be rather difficult to gain high accuracy at the same time. In the same way, crowd-sourced or man-based ER methods can achieve good enough accuracy, but they cannot do any better than machine-based algorithms as for resolution efficiency.Therefore we propose a hybrid way which combines machine-based algorithms and human intelligence, namely the hybrid human-machine based entity resolution. At first, it runs similarity-based or learning-based algorithms using MapReduce-based parallel computing framework which can be found in Hadoop open-source projects, to exclude record pairs that are unlikely to be matched, and in this way, it can reduce the number of human intelligence tasks. And then, those ambiguous record pairs are labeled by human operation.The main work of this paper mainly includes:1) the methods and frameworks of ER are summarized;2) an ER method which combines crowdsourcing technology and machine computing is proposed;3) a parallel MapReduce-based ER framework is also propsed;4) we apply the method and framework to the master patient index construction platform of a hospital.The experiment results show that the method proposd can make full use of the advantages of machine-based and human-based processing ways. It brings high efficiency and accuracy for patient entity resolution.
Keywords/Search Tags:entity resolution, crowdsourcing, Hadoop, Mapreduce, hybridhuman-machine model
PDF Full Text Request
Related items