Design And Implementation Of Author Name Disambiguation System Based On Two Step Clustering

Posted on:2021-10-12

Degree:Master

Type:Thesis

Country:China

Candidate:S Hu

Full Text:PDF

GTID:2518306557492564

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The scientific and technological literature data of the industry-university-research data service platform comes from the Internet literature knowledge base,but the problem of ambiguity of the author’s name has to be faced when processing the scientific and technological literature data.The problem of author name ambiguity refers to the fact that in the literature database with the author’s name as the main identification,due to the common phenomenon of the same name of scholars,it is often impossible to determine the author of the literature.In the process of advancing industry-university-research cooperation,staff and enterprises use scientific and technological literature search to correspond to experts and scholars,and the ambiguity of author names will seriously affect the accuracy of search.Therefore,building a data cleaning tool that can effectively eliminate the ambiguity of names and accurately determine the attribution of the author of the document has important application value.This thesis designs and implements the author’s duplicate name disambiguation system in the current literature.The system processes the author information and document information crawled by the web crawler,extracts document characteristics,divides the documents into different clusters,and then links the document clusters and scholar entities.After realizing the integration of literature and scholar data from different data sources,the system uses Web applications to realize data visualization.The main work of this paper is as follows:(1)This thesis proposes a TSC algorithm based on step-by-step clustering to disambiguate authors with the same name.Construct a cooperative relationship graph of authors to be disambiguated and calculate author similarity through path parameters and complete clustering,then use word vector model training and prediction to obtain document text vector representation and calculate text similarity,and then complete the second through text similarity Step clustering.The final document clusters are the documents under the names of different authors.Comparative experiment results show that the overall accuracy and recall rate of the step-by-step clustering algorithm are better.(2)For Chinese papers,English papers,patents and other literature data from different data sources,different strategies are used to integrate expert and scholar entities with these multi-source heterogeneous literature data.In the end,all the scientific and technological documents under the expert’s name can be retrieved.The experimental results show that the error rate of data integration is generally low,and it basically realizes the accurate connection between the literature and the author and is usable.(3)Designed and implemented a disambiguation system for authors with the same name,which mainly includes a disambiguation clustering module,a multi-source heterogeneous data integration module,and a data visualization module.Perform functional tests on the functions provided by the system to verify that all functional modules of the system can operate normally.

Keywords/Search Tags:

Name Disambiguation, Text Vector, Similarity Calculation, Text Clustering, Information Integration

PDF Full Text Request

Related items

1	Research On The Calculation Method Of Han-Thai Bilingual News Text Similarity With News Elements
2	Study On Similarity-based Text Clustering Algorithm And It's Application
3	Study On Similarity-based Text Clustering Algorithm And Its Application
4	Research On Semantic Similarity Calculation Of Chinese Short Text
5	Chinese Text Clustering Based On Text Similarity
6	Text Similarity Computing Theory And Applied Research
7	Research On Text Sentiment Clustering Method Based On Dimension Identification
8	Research And Implementation Of The Text Cluster Based On Text Similarity Caculation
9	Semantic Similarity Calculation Text Field Vector Space Model
10	Study On Text Clustering Based On Topic Sentence Vector Model