Font Size: a A A

Research On The Application Of Person Name Disambiguation Based On Improved Agglomerative Hierarchical Clustering

Posted on:2021-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y W LiuFull Text:PDF
GTID:2518306032965059Subject:Information Science
Abstract/Summary:PDF Full Text Request
As one of the most important resources in school construction,alumni resources play a special and important role in school development and inheritance.How to excavate,integrate and make good use of alumni resources is an important topic.However,due to the ambiguity of the names of alumni,a large number of irrelevant information is often obtained by directly using the Internet to retrieve the names of alumni.Therefore,this paper attempts to use the name disambiguation technology to complete the identification of alumni.Firstly,this paper studies the hierarchical clustering algorithm,based on literature bibliometrics and knowledge mapping,this paper studies the related research literature of hierarchical clustering algorithm in the past two decades through the contribution and the drawing of knowledge map Analysis,from the issue of the trend,the discipline distribution of the literature,the cooperation of the author,as well as research hot spots and cutting-edge aspects of a detailed analysis,provides a basis for the text algorithm improvement.Then,the traditional hierarchical clustering algorithm is improved.Based on the idea of quantile,a new method of calculating the distance between clusters based on Quantile is proposed.The distance between clusters is measured by the average value of the distance between data points in quantile interval,so that the influence of outliers on clustering accuracy is eliminated to some extent,and the accuracy of clustering is improved,which is more suitable for the name of this paper Application scenarios of disambiguation and alumni identification.After that,an alumni information recognition model based on improved hierarchical clustering algorithm is proposed,which consists of four modules:text preprocessing,text keyword extraction,text feature vector generation,person name disambiguation and alumni recognition.Firstly,the model uses word2vec tool to express the text and generate the word vector of the web page text.Based on the idea of mean word2vec,the model calculates the mean value of the word vector of the text keyword and takes it as the feature vector of the web page text,so as to overcome the shortcomings of the high data dimension of the traditional text representation model,and uses the same way to make the text characteristics of the alumni keywords in the alumni information knowledge base The generation of feature vector is used to identify alumni clusters.Then the model uses the improved hierarchical clustering algorithm to cluster the web page feature vector to get the result of person name disambiguation,and then uses the constructed alumni verification text feature vector to identify alumni information.The experimental results show that the proposed alumni information recognition model based on the improved hierarchical clustering algorithm can effectively disambiguate the web page text and identify the alumni information.
Keywords/Search Tags:Person name disambiguation, Hierarchical clustering, Alumni, Word2vec, Text classification
PDF Full Text Request
Related items