Font Size: a A A

Research On Name Disambigutaion And Its Application

Posted on:2010-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:F WangFull Text:PDF
GTID:2178360278462165Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Name disambiguation aims at resolving the problem of multiple persons having the same name, also called name ambiguity. Name ambiguity is a general problem in the real world, and now become one of the challenges for information integration, information retrieval, and data mining applications, especially with the fast development of the Web. In this thesis, we try to conduct a thorough investigation of this problem. Specifically, we study the problem using academic data. We formally define the problem of name disambiguiation in the academic social network and propose two approaches to solve it: an atomic cluster-based disambiguation approach and a constraint-based topic modeling approach.As data points in the academic social network are usually sparse, traditional clustering algorithms usually fail to achieve good performance. We propose an atomic cluster-based disambiguation approach, which consists of two stages. In the first stage, we propose using an extended AdaBoost algorithm to automatically detect atomic clusters, inside which points are strongly connected and in the second stage, based on the detected atomic clusters, we use different clustering methods to find the final result of name disambiguation. Experimental results show that this atomic cluster method can significantly outperform the traditional clustering methods: averagely +25% than k-means and +8% than hierarchical clustering algorithm.We further study a topic modeling approach for name disambiguation. The basic idea is to map data points from original feature space to a topic space. However, traditional topic model cannot find good topic distribution in the researchers'social network data. We thus proposed a constraint based topic model to break down this limitation. We define five types of constraints according to the background knowledge and incorporate the constraint into the objective function of the topic model. An adapted Gibbs sampling algorithm is employed to estimate parameters of the model. Finally, based on the discovered topics, we use a hierarchical clustering method to find the final disambiguation results. Experiments show that the constraints based method can significantly improve the performance of name disambiguation.We apply the proposed name disambiguation approach to a real-world academic search system: ArnetMiner. A name disambiguation module has been designed and integrated into the system.
Keywords/Search Tags:Name Disambiguation, Atomic Cluster, Constraint, Topic Model
PDF Full Text Request
Related items