Font Size: a A A

Research On The Method Of Author-paper-identification

Posted on:2022-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:P B GaoFull Text:PDF
GTID:2518306350495454Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of natural language processing and data mining related technologies,identification technology is becoming more and more perfect.The technology of automatic identification of the author of the paper can be used by machine learning method,so that the machine can accurately identify the correspondence between the author and the paper,according to the information of the paper to match the corresponding author.The use of automatic identification technology of authors can not only meet the daily work needs of staff in scientific research management institutions,but also greatly reduce labor costs and improve work efficiency.These are the advantages that can't be compared with the identification of the author of the paper by hand,so it is necessary to use the technology of automatic identification of the author of the paper instead of the manual completion of some of the tasks.The problem of automatic identification of the author of the thesis is a two-category problem,which is mainly oriented to the two major data sets of thesis and the author.However,the author is faced with the problem of having a re-name or a multi-version name,which can lead to errors in classification,so it is difficult to classify.The main topics discussed in this thesis include: feature design and extraction required for classification and integration of classifier model fusion after learning.In the design and extraction task of features,co-author similarity characteristics are the most important characteristic factors of the author's automatic identification technique.However,the traditional co-author similarity characteristics are usually the statistics of each author's most frequent cooperation of the former k authors as the author's co-authors,but this thesis achieved a breakthrough,put forward a co-author network,using the structure to re-name the author to disambiguate the effect is much better than the traditional co-author statistics,the number of co-authors network after disambiguation will be reduced by nearly 50%,the accuracy rate reached 94.82%,F1 value reached 96.08%.In addition to the breakthrough of co-author characteristics,three new features are also designed,namely string distance characteristics,journal and conference correlation,keyword information,the three characteristics of the nature of the difference is large,can represent the author's matching characteristics from other angles,better highlight the degree of correlation between the two.Multi-classifier system and ensemble learning is an extremely important part of classification tasks,this thesis will have classifier Ada Boost,random forest,XGBoost three machine learning model fusion,the use of ensemble learning in the combination of strategic ideas,the use of more suitable for classification tasks of the voting method,through k-fold cross-validation of the final model after the adjustment of a set of parameters,to design a set of parameters,to integrate these kinds of base classifiers,The improved fusion model is obtained,and the accuracy rate is increased to 97.14% and the F1 value is 96.71%.In this thesis,a better method of automatic identification of the author of the paper is found.The main innovation points in the construction of new features and classifier fusion of these two aspects,through improving the content of these two aspects,so that the experimental method in the final classification effect has been improved.
Keywords/Search Tags:Author Identification, co-author Relationship, Construction Characteristics, Similarity Calculation, Classifier
PDF Full Text Request
Related items