Font Size: a A A

Research And Application On Disambiguating Authors

Posted on:2021-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:N LiFull Text:PDF
GTID:2428330620468179Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The continuous improvement of the level of informatization has accelerated the construction of digital libraries,which has greatly facilitated people's study and work.However,the rapid development of digital libraries has also encountered the problem of data fragmentation,resulting in low data quality and poor data availability.Author ambiguity,which means that different authors share the same name,is one of the typical problems in digital libraries.Author ambiguity seriously affects the content quality and service experience of digital libraries.Author disambiguation aims to identify different authors shared the same name and their published papers.Due to the massiveness,low quality,and interdependence of data,it is a challenging task to disambiguate authors.The current mainstream methods may be suboptimal to disambiguate authors because of the poor ability to express features and the introduction of low-quality relationships.Thus,the performance of author disambiguation methods can be significantly improved.This thesis has achieved the better performance for author disambiguation via improving the ability to express features,and reducing the negative impact of low-quality relationships between authors.The main contributions are as follows:·Disambiguating authors based on the fusion of multi-type features.To overcome the limitations of poor ability to express features and low-quality relationships introduced by undisambiguated collaborators,we propose an author disambiguation method namely CMFAD,which integrates both implicit and explicit features.Firstly,CMFAD designs a classifier that integrates multi-type features to predict the probability that two papers belong to the same author.To train the classifier,the feature set consists of both implicit and explicit ones,where the implicit features capture the semantics of paper titles and collaborative relationships via employing the models,and the explicit features are extracted manually.Then,CMFAD proposes a probabilistic reasoning mechanism to resolve the conflict of classification results.·Disambiguating authors in an incremental and unsupervised manner.Considering that the current mainstream methods capture low-quality relationships and ignore the higher collaborative relationships,we treat the author disambiguation as the reconstruction of collaboration network,and propose an incremental,two-stage and unsupervised author disambiguation method namely IUAD.Specifically,in the first stage,IUAD analyzes the effect of frequent collaborative relations,and then mines these relations to build a stable collaboration network,which takes full advantage of the higher collaborative information;in the second stage,IUAD designs a probabilistic generative model that utilizes the exponential distribution family to integrate the collaboration network topologies,research interests and research communities,which improves the recall well.In addition,for the newly published papers,IUAD does not need to retrain the model,and can disambiguate these papers incrementally.·Optimizing author disambiguation method based on labeled data.To further reduce the time consumption of IUAD in the stage of global collaboration network construction,we propose a method which optimizes our proposed method IUAD by introducing some labeled data,namely IIUAD.It makes full use of high-precision rules and labeled data to achieve more efficient candidate pairs pruning,which further improves the efficiency of our proposed method.
Keywords/Search Tags:Author Disambiguation, Implicit Features, Explicit Features, Collaboration Network, Probabilistic Generative Model
PDF Full Text Request
Related items