Research On Name Disambiguation Method For Author Retrieval Of Sci-tech Literature

Posted on:2022-09-21

Degree:Master

Type:Thesis

Country:China

Candidate:S S Wang

Full Text:PDF

GTID:2518306476498664

Subject:Electronics and Communications Engineering

Abstract/Summary:

With the increasing number of scientific and technological documents and retrieval requirements,the problem of the same name of document authors has increasingly affected the quality of document retrieval.The research on name disambiguation methods is a key issue to be solved in the process of document knowledge base construction.Aiming at the problem that the accuracy of name disambiguation needs to be further improved,this paper makes full use of data features and proposes a two-stage clustering disambiguation improvement method based on the fusion of semantic features and graph relationship features.This method adds IDF weighting,triplet loss learning,self-defined random network walk probability and other improvement measures to fully mine the data feature information,and achieves a good disambiguation effect on the evaluation data set.The research work is as follows:First of all,the problem of making standard data set to be disambiguated is researched,a detailed data set extraction and making process is given,and the Aminer data set used in the subsequent improvement of the disambiguation method in this article is statistically analyzed.Through the statistical analysis of its attribute characteristics,it is found that each attribute feature contains more low-frequency components,which cannot be effectively distinguished according to the rules,which provides an idea for the subsequent improvement of the disambiguation method.Secondly,an improved method of name disambiguation based on text semantic feature embedding is proposed.Using IDF weighting,the triplet loss model adjusts the embedding vector,and calculates the document semantic distance matrix.Then a twostage clustering strategy is proposed.The first stage uses the DBSCAN algorithm to pre-cluster,and the second stage uses algorithms such as maximum similarity matching for outliers to achieve disambiguation.The evaluation results show that theimprovement scheme is effective,and the macro average F1 is increased from 0.38 for single semantic embedding to 0.47.Then,on the basis of text semantic feature embedding,an improved method of name disambiguation based on the fusion of semantic feature and graph relationship feature is further proposed.Introduce a graph network model,use the node jump probability function to obtain a set of random walk paths,and embed it in the vector space to calculate the document relationship distance matrix.The variable step search algorithm combining document semantic vector and feature fusion obtains the final feature distance matrix,and uses two-stage clustering algorithm and feature distance matrix to achieve the final name disambiguation.Experimental results show that after adding graph network embedding and feature fusion,the evaluation accuracy is better,and F1 is increased from 0.47,which only considers semantic features,to 0.71.Finally,the engineering application solutions and application cases of name disambiguation are given.The optimization measures for the algorithm to be realized under the large amount of data are proposed,and the Elsevier paper library is used to realize the engineering application of the disambiguation algorithm in this paper,so as to obtain the expert library after the disambiguation.Then,this article uses the expert library to discuss web search results and graph analysis applications.

Keywords/Search Tags:

name disambiguation, IDF weighting, random walk, feature fusion, two-stage clustering

Related items

1	Research And Application Of The Chinese Organization Names Recognition And Disambiguation
2	Research On Complex Network Clustering Algorithm Based On Random Walk
3	Research On Disambiguation Of Same Authors In Academic Collaboration Network
4	Research And Implementation On A Hybrid Random Walk Algorithm For 3-D Thermal Analysis Of Integrated Circuits
5	Dynamic Agent Based Clustering Algorithms And Their Quantization
6	Random Walk Learning On Graph
7	Disambiguation Of Homonymous Scholars Based On Semantic And Paper Relation Knowledge Graph
8	A Research Of Frame Disambiguation Based On SVM And CRF Model
9	Bibliometric Analysis And Name Disambiguation Research Based On Knowledge Clustering
10	Research On Fuzzy Clustering Algorithm Of Sample And Feature Weighting