Font Size: a A A

A Study On Person Name Discrimination Algorithm Based On Two-Stage Clustering

Posted on:2013-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:L W ZhangFull Text:PDF
GTID:2298330467478187Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the growing popular of Internet, submitting to the search engine queries for information retrieval has become the primary method for people to obtain network information. Person name retrieval is one of the most common search task, through search engine one can easily obtain the information of a character, but due to the duplication of names is quite common, for a name search, search engines often return a long list of results contains many of the same name. To find specific characters, people need to add features to improve the query, or to browse the search results list for finding the figures of people which you want to query from a number of duplicate names. This would lead to search perormance declining significantly. Therefore, it is necessary to.study a person’s name disambiguation algorithm to improve the efficiency of names retrieval.In this paper, on the basis of analysis of existing names disambiguation theory and technology, we proposed a two-stage clustering method for the disambiguation of names. Person attributes is important to the discrimination of names, First, we extract16kinds of main character attributes, for nine kinds which relatively easy to extract, we using the traditional regular expression and dictionary matching method, but for seven kinds of attributes which are difficult to extract, we using an automated extraction method based on self-expansion. And then for the results returned by the search engine,we express them by the attribute vector, by calculating the similarity between documents, complete the initial clustering. Because not all pages include attribute information, so after the initial clustering, pages which does not contain the character attribute information can not be correct clustering. Therefore, this paper proposes a second-stage clustering method based on semantic relations. First, we extract semantic relations betweenthe concept of Wikipedia, and calculate the semantic relations to construct the semantic relationship graph; Secondly, using SimRank algorithm to calculate the similarity between any two nodes; then the initial clustering results expressed by the Wikipedia concept vectors; Finally, according to the concept semantic relations, the clusters similarity are computed, and the second stage clustering is completely.Experimental results show that our proposed two-phase clustering combination of the person’s name disambiguation algorithms have significantly improve precision and recall rates, and obtain better performance than previous methods. This prove that the proposed algorithm is effective on the person’s name disambiguation problem.
Keywords/Search Tags:person name disambiguation, attribute extraction, semantic relationship graph, clustering
PDF Full Text Request
Related items