With the development of information technology,the scale,storage method,and acquisition method of information have undergone great changes,and various academic search engines have also appeared.These search engines have also become the main way for scholars to obtain various paper information.Although these search engines have brought great convenience to scholars,there is still a phenomenon that documents with the same name author are not assigned to the correct author,which makes the retrieval of documents by name less accurate.In recent years,a large number of scholars have conducted research on the disambiguation of the same name,but there are still problems such as underutilization of paper information and neglect of new papers.Starting from the two directions of incremental disambiguation and making full use of paper information,with the purpose of making full use of information and paying attention to newly added papers,this paper studies the problem of disambiguation with the same name of authors.The main work is as follows.First of all,this paper proposes a feature extraction method based on the combination of XLNet pre-training model and artificially defined rules to solve the problem of insufficient utilization of paper information.The method first uses artificially defined features to extract the information of the author’s name,institution and other fields in the paper,uses XLNet to extract the information of the paper’s title,abstract and other fields,and then uses XGBoost and the extracted features to predict the correct author that each paper should belong to.Finally,the comparative experimental results on the constructed dataset show that the proposed framework outperforms the comparative methods in incremental disambiguation.Secondly,this paper proposes a cold-start disambiguation method based on agglomerative hierarchical clustering to solve the problem that incremental disambiguation cannot assign all papers.This method is placed after the incremental disambiguation method.papers are post-processed.The method performs agglomerative clustering of unsuccessfully assigned papers,and then adds papers to the main cluster through incremental disambiguation to obtain the main cluster as a new author.Finally,the comparative experimental results on the constructed dataset show that the cold-start disambiguation framework proposed in this paper can make the final incremental disambiguation results better.Finally,this paper combines the AMiner data set with DBLP to construct a new data set for the experiments in this paper.The final experimental results prove the feasibility of the incremental disambiguation algorithm proposed in this paper. |