Font Size: a A A

Scientific Publications Author Name Disambiguation And Entity Linking

Posted on:2013-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:W Q SongFull Text:PDF
GTID:2268330392967759Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the on-line literature systems, such asDBLP and Microsoft Academic Search, are becoming increasingly popular, whichare author-oriented. Meanwhile, author recognition with the same name has becomean urgent issue in these systems. This thesis divides the problem of authorrecognition into two sub-problems, namely name disambiguation and entity linking.For tackling the problem of name disambiguation, two methods are employed whichare based on clustering algorithm and feature graph respectively. To solve the entitylinking problem of teachers, this thesis applies different linking methods for Chineseand English publications. The main content of this thesis consists of the followingfour parts.(1) The establishment of publications test data set and the feature selectionmethod: Firstly this thesis proposes a method to collect and filter publications andpresents a general method to evaluate the performance of the algorithms of namedisambiguation. Secondly, different publication features have different effects ondisambiguation. This thesis designs a frame to test them and the results shows thatthe features of co-author, journal and keywords are the most effective ones for namedisambiguation, while the features of abstract and title are weak relatively. It is veryimportant to understand the entire property features to improve the algorithm ofname disambiguation and to boost the accuracy of entity linking.(2) The name disambiguation algorithm based on clustering: The general idea isto transform the disambiguation problem to a clustering problem. This thesis makesuse of three typical clustering algorithms, namely hierarchical clustering, K-meansand Affinity Propagation. Based on the comparative analysis of the pros and cons ofthese three methods, this thesis proposes a two-step clustering algorithm.Experimental results showed that this method is better than general clusteringalgorithms, with the precision and recall getting advance to90%and75%respectively.(3) The name disambiguation algorithm based on the relationship graph offeature: This thesis draws the relationship graph of feature into the namedisambiguation problem and proposes two methods for name disambiguation. One ishierarchical clustering algorithm based on graph property, and the other is thealgorithm based on connected subgraph. Experimental results showed that both methods are better than the clustering methods and that the algorithm based onconnected subgraph achieves the best performance. The precision and recall getadvance to94.5%and84%respectively.(4) Entity linking and the system of Tnet: entity linking mainly solves the linkproblem between the author of publications and the teacher who is the same personwith the author. This thesis applies different link methods for Chinese and Englishpublications. Finally, the research results are employed to publication presentationmodule of the Tnet system.
Keywords/Search Tags:author recognition, name disamibiguation, clustering algorithm, entitylinking
PDF Full Text Request
Related items