Font Size: a A A

Design And Implementation Of Author Name Disambiguation System Based On Two Step Clustering

Posted on:2021-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:S HuFull Text:PDF
GTID:2518306557492564Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The scientific and technological literature data of the industry-university-research data service platform comes from the Internet literature knowledge base,but the problem of ambiguity of the author's name has to be faced when processing the scientific and technological literature data.The problem of author name ambiguity refers to the fact that in the literature database with the author's name as the main identification,due to the common phenomenon of the same name of scholars,it is often impossible to determine the author of the literature.In the process of advancing industry-university-research cooperation,staff and enterprises use scientific and technological literature search to correspond to experts and scholars,and the ambiguity of author names will seriously affect the accuracy of search.Therefore,building a data cleaning tool that can effectively eliminate the ambiguity of names and accurately determine the attribution of the author of the document has important application value.This thesis designs and implements the author's duplicate name disambiguation system in the current literature.The system processes the author information and document information crawled by the web crawler,extracts document characteristics,divides the documents into different clusters,and then links the document clusters and scholar entities.After realizing the integration of literature and scholar data from different data sources,the system uses Web applications to realize data visualization.The main work of this paper is as follows:(1)This thesis proposes a TSC algorithm based on step-by-step clustering to disambiguate authors with the same name.Construct a cooperative relationship graph of authors to be disambiguated and calculate author similarity through path parameters and complete clustering,then use word vector model training and prediction to obtain document text vector representation and calculate text similarity,and then complete the second through text similarity Step clustering.The final document clusters are the documents under the names of different authors.Comparative experiment results show that the overall accuracy and recall rate of the step-by-step clustering algorithm are better.(2)For Chinese papers,English papers,patents and other literature data from different data sources,different strategies are used to integrate expert and scholar entities with these multi-source heterogeneous literature data.In the end,all the scientific and technological documents under the expert's name can be retrieved.The experimental results show that the error rate of data integration is generally low,and it basically realizes the accurate connection between the literature and the author and is usable.(3)Designed and implemented a disambiguation system for authors with the same name,which mainly includes a disambiguation clustering module,a multi-source heterogeneous data integration module,and a data visualization module.Perform functional tests on the functions provided by the system to verify that all functional modules of the system can operate normally.
Keywords/Search Tags:Name Disambiguation, Text Vector, Similarity Calculation, Text Clustering, Information Integration
PDF Full Text Request
Related items