Font Size: a A A

Research On The Representation Of Scientific Papers

Posted on:2022-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:R GuFull Text:PDF
GTID:2518306509494304Subject:Natural language processing
Abstract/Summary:PDF Full Text Request
In the field of natural language processing,embedding is a technology that represents text in a form that is easy for computers to process.However,most embedding models,whether they are word-level,sentence-level,or document-level models,focus on only one document,and at most focus on the relationship between upper and lower sentences or different paragraphs,and do not use inter-document Relevance information,which limits the model's ability to represent document-level text.Recently,a model for embedding information between documents using citation information has emerged.In order to use relevant information between documents,the model does not directly input the relevant information into the neural network along with the documents,but will have some kind of correlation.Multiple documents of information are input into the model for training as a training example,so that documents with higher correlation can be expressed as embedding vectors with closer distance in the vector space,and documents with lower correlation are expressed as embedding vectors with farther distance.Although the model has achieved good results on many tasks,it still has shortcomings.First,it uses too few types of information,because the current embedding model only uses citation information,and the relevant information between documents is still There are many.The second is the citation information itself.For an article,different references play different roles.Some references are only mentioned as background,while others are used as the direct source of ideas for the creation of the article,and There are still uncited but closely related articles,so the relevance of different references varies greatly.Based on the above two deficiencies,we proposes two improvement methods.The first method is to use the author's information to improve the model.This topic studies scientific literature,that is,the text type of academic papers in various fields.This type of text contains various types of information including authors and publishers that can be used.According to this plan,we refer to the current method of using citation information and use whether it is the same as the author as the basis for judging the relevance of the article to construct higherquality training examples with richer information for training.Experiments based on this idea outperform existing models that use information between documents on some tasks.The second method is to distinguish different types of reference information,explore its impact on neural network training,and then combine these information to improve the model.According to this plan,we judge the relevance of articles based on the various relationships of references on the citation network,such as indirect citations,co-citations,etc.,to construct high-quality training examples for training.The improvement experiments in this work are in multiple A relatively satisfactory result was achieved on the task.The experimental results show that the method of using author's information to improve the model and the method of using the difference in citation information to improve the model are effective.This shows that all kinds of information including citation information and author information,as long as the relevant information contained in it can be used correctly Degree information can have a positive impact on the training of the model.The above improvement work has certain significance for the research and analysis of scientific literature and the improvement of embedded models in the future.
Keywords/Search Tags:Embedding model, scientific papers, related information between documents
PDF Full Text Request
Related items