Font Size: a A A

Research On Hybrid Human-Machine Based Entity Resolution Methods

Posted on:2020-05-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:S S GongFull Text:PDF
GTID:1368330578963124Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Semantic Web technology,especially the promotion of the Linking Open Data project,a large amount of linked data has been published on the Web,gradually forming a Web of data.These data describes a great variety of entities from different domains.Due to the open and decentralized nature of the Web,real-world objects are usually described in multiple data sources,and there exist overlap and complementarity between these descriptions.The task of entity resolution is to identify entities that refer to the same real-world object.It plays an important role in the Semantic Web applications like data integration,search and browsing.For Semantic Web data,automatic machine-based entity resolution methods have achieved considerable results.However,due to the heterogeneity,large-scale and spotty quality of Semantic Web data,existing machine-based methods are far from perfect and they require further improvement.In recent years,hybrid human-machine based entity resolution methods have attracted extensive attention.They introduce human knowl-edge to supplement machine resolution,reduce possible errors and improve resolution performance.However,this kind of hybrid resolution also brings new problems and challenges.The first issue to be considered is how to effectively combine human in-telligence and machine processing,and make full use of the complementary strengths of human and machine for entity resolution to achieve better results.Secondly,human feedback is prone to errors and noise.Therefore,quality control is another problem to be solved in human-involved entity resolution.In addition,an important issue with quality control is cost optimization.To obtain higher accuracy,the amount of human participation can be increased,but the cost will be higher.To address the above prob-lems,the main work of this dissertation is listed as follows:· Propose an entity resolution method based on distributed human computation and consensus partition.The method firstly leverages distributed human computation to identify a part of coreferent entities,and then performs large-scale entity resolu-tion with machine learning algorithms.In order to solve the quality control prob-lem,consensus partition is used to aggregate all user-judged resolution results and resolve their disagreements.To alleviate user involvement,ensemble learning is performed on the consensus partition to automatically identify coreferent entities that users have not judged.The method is integrated into an online Linked Data browsing system.Driven by the incentive of data browsing,users can participate in entity resolution with their daily browsing activities.The experimental results show that the method largely improves the accuracy of user-judged resolution re-sults,and reduces user involvement by automatically identifying a large number of coreferent entities.· Propose an entity resolution method based on the diverse user expertise on different topics.This method leverages distributed human computation for entity resolution.By using text analysis,multiple topics of each entity resolution task are firstly iden-tified,on which user expertise is modeled,so that the quality of user-judged results can be further improved.Meanwhile,in order to address the data sparsity problem,similar task clustering is used to enhance the topic modeling between similar tasks.Finally,the method completes user expertise estimation,similar task clustering and task result inference simultaneously in a unified model.The experimental results show that the method can obtain resolution results with higher accuracy using fewer users,and its estimated expertise is more consistent with the user real expertise.· Propose an entity resolution method based on deep reinforcement learning.This method clusters coreferent entities together and leverages human feedback to opti-mize the clustering process and improve the resolution performance.In order to ef-fectively combine human decision-making with machine processing,the clustering process is formalized as a reinforcement learning problem.Firstly,the method uses a neural network to learn representations of entity pairs based on the property-value information and generate the feature vectors for clustering.Then,a policy network is used to decide which clusters need to be merged in each step of clustering.The method uses human feedback to produce cumulative rewards and optimizes the pro-cess of feature vector learning and entity clustering from a global perspective.The experimental results show that this method outperforms the state-of-the-art methods in terms of accuracy on different datasets.· Propose a property clustering framework for entity resolution.Property clustering is used for finding related or matched properties,which is the basic component that many entity resolution methods rely on.The proposed framework utilizes human feedback on property clustering results to identify the topic-related properties more accurately,so that the feature vectors for entity resolution based on property infor-mation can be extracted better.The framework introduces 13 different measures to investigate property relatedness from different perspectives and seven cluster-ing algorithms of different characteristics.In order to combine various relatedness measures and clustering results,the framework improves the property clustering performance by using two combination methods including the linear combination of measures and consensus clustering.The experimental results show that different measures and clustering algorithms have different preferences in terms of precision and recall of property clustering,and a proper combination of different measures and clustering algorithms can give rise to the best clustering result.
Keywords/Search Tags:Entity Resolution, Distributed Human Computation, User Modeling, Re-inforcement Learning, Property Clustering
PDF Full Text Request
Related items