Font Size: a A A

Research On Name Disambiguation Method For Author Retrieval Of Sci-tech Literature

Posted on:2022-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:S S WangFull Text:PDF
GTID:2518306476498664Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the increasing number of scientific and technological documents and retrieval requirements,the problem of the same name of document authors has increasingly affected the quality of document retrieval.The research on name disambiguation methods is a key issue to be solved in the process of document knowledge base construction.Aiming at the problem that the accuracy of name disambiguation needs to be further improved,this paper makes full use of data features and proposes a two-stage clustering disambiguation improvement method based on the fusion of semantic features and graph relationship features.This method adds IDF weighting,triplet loss learning,self-defined random network walk probability and other improvement measures to fully mine the data feature information,and achieves a good disambiguation effect on the evaluation data set.The research work is as follows:First of all,the problem of making standard data set to be disambiguated is researched,a detailed data set extraction and making process is given,and the Aminer data set used in the subsequent improvement of the disambiguation method in this article is statistically analyzed.Through the statistical analysis of its attribute characteristics,it is found that each attribute feature contains more low-frequency components,which cannot be effectively distinguished according to the rules,which provides an idea for the subsequent improvement of the disambiguation method.Secondly,an improved method of name disambiguation based on text semantic feature embedding is proposed.Using IDF weighting,the triplet loss model adjusts the embedding vector,and calculates the document semantic distance matrix.Then a twostage clustering strategy is proposed.The first stage uses the DBSCAN algorithm to pre-cluster,and the second stage uses algorithms such as maximum similarity matching for outliers to achieve disambiguation.The evaluation results show that theimprovement scheme is effective,and the macro average F1 is increased from 0.38 for single semantic embedding to 0.47.Then,on the basis of text semantic feature embedding,an improved method of name disambiguation based on the fusion of semantic feature and graph relationship feature is further proposed.Introduce a graph network model,use the node jump probability function to obtain a set of random walk paths,and embed it in the vector space to calculate the document relationship distance matrix.The variable step search algorithm combining document semantic vector and feature fusion obtains the final feature distance matrix,and uses two-stage clustering algorithm and feature distance matrix to achieve the final name disambiguation.Experimental results show that after adding graph network embedding and feature fusion,the evaluation accuracy is better,and F1 is increased from 0.47,which only considers semantic features,to 0.71.Finally,the engineering application solutions and application cases of name disambiguation are given.The optimization measures for the algorithm to be realized under the large amount of data are proposed,and the Elsevier paper library is used to realize the engineering application of the disambiguation algorithm in this paper,so as to obtain the expert library after the disambiguation.Then,this article uses the expert library to discuss web search results and graph analysis applications.
Keywords/Search Tags:name disambiguation, IDF weighting, random walk, feature fusion, two-stage clustering
PDF Full Text Request
Related items