Research On Chinese Person Name Disambiguation Algorithm

Posted on:2016-02-22

Degree:Master

Type:Thesis

Country:China

Candidate:C P Lin

Full Text:PDF

GTID:2308330479987009

Subject:Computer technology

Abstract/Summary:

Name ambiguity is an identity uncertainty phenomenon that more than one person share the same name, which is prevalent at home and abroad. In the Internet era of information explosion, as the main part of social activity, people constitute a huge information network. As a result, person searching plays an important role in information retrieval. However, current search engines normally return a large set of documents including searched name string which is not convenient for users to find and select the information related to a specific entity. Person name disambiguation is to solve the name ambiguity problem which is an obstacle for network communication and information retrieval. It is mainly research on how to identify different namesakes through web pages and how to display the result that allows users to find what they really need quickly and accurately. Meanwhile, Person name disambiguation could be widely used in hot people tracking and discovery, personalized search, question answering track and so on, which makes it one of the hot research hotspots of the natural language processing technology in recent years.Chinese name disambiguation started relatively late in China, combined with the particularity of Chinese natural language processing, currently there are still many problems to be solved. In this paper, we research on the problem of Chinese name disambiguation within web pages, and we mainly focus on the improvement of text similarity and clustering method to improve the performance of Chinese person name disambiguation. The main contents of this paper and the achievements are summarized as follows:1.We have made enough investigation on person name disambiguation and summarize the basic knowledge of person name disambiguation, which includes the basic task, processing steps, tough challenges and correlation techniques that would be used.2.For the vector space model ignores the semantic and order of features, we do research on the text representation model based on the longest common subsequence and propose a text clustering method based on the improved longest common subsequence(LCSC). This method firstly uses the orderly sequence of features to represent the text, and it uses word similarity to calculate the longest common feature subsequence; Then the text similarity matrix is constructed based on the feature weight; Finally we use bottom-up hierarchical clustering algorithm. Experiment results show that the LCSC method can significantly improve the overall performance in person name disambiguation compared with traditional clustering method and make the average F-measure increased from 74.2% to 84.9%; The overall performance is also improved by 3.7% compared with the longest common subsequence method.3.In order to mitigate the big cluster phenomenon which is brought by hierarchical clustering in person name disambiguation, we propose a clustering method which combines the position or title property with topic information to improve clustering. Firstly, identify the position or title in the whole document set to classify according to the different characters of name entity; Then construct topic set for every cluster; In the end, use improved text similarity calculation method based on the topic information for hierarchical clustering. Experiment results show that our method can ease the big cluster problem effectively, and the overall performance is improved by almost 13% compared with traditional hierarchical clustering method.

Keywords/Search Tags:

Person name disambiguation, Text similarity, Topic set, Hierarchical clustering, Longest common subsequence

Related items

1	The Research On Algorithms For The Longest Common Subsequence Problem And Variants
2	Text Clustering Method In Topic Detection And Person Name Disambiguation
3	Parallel Algorithm For Multiple Longest Common Subsequence And Application Research On Hadoop Platform
4	Research On The Application Of Person Name Disambiguation Based On Improved Agglomerative Hierarchical Clustering
5	The Research On The Longest Common Subsequence Query Algorithm
6	Explorations on the longest common increasing subsequence problem
7	Person Name Disambiguation Based On Hierarchical Clustering And Web Page Relationship
8	Study On Parallel Algorithms For Longest Common Subsequence On Heterogeneous Cluster Computing Systems
9	Approximate Longest Common Subsequence Query Processing And Optimization On Biological Sequence
10	Graph Models And Algorithms Of The Longest Common Sub-sequences For Many Long Sequences