Font Size: a A A

Automatic Identification Of Cited Spans And Classification Of Citation Type In Academic Articles

Posted on:2021-12-17Degree:MasterType:Thesis
Country:ChinaCandidate:J XuFull Text:PDF
GTID:2518306512488334Subject:Information Science
Abstract/Summary:PDF Full Text Request
Since citation context contains significant content of reference paper,such as research methods,research conclusions and research limitations,it has been widely applied on automatic summarization.However,with different citing motivations,citing authors will present different descriptions of reference paper,making it hard for citation context to reflect reference paper comprehensively and accurately.In recent years,many shared task,like CL-Sci Summ Shared Task and TAC 2014(Text Analysis Conference),have proposed a method of structured summary generation based on cited text spans(CTS)in scientific literature.CTS refers to the content in the reference paper which best reflect citation context or reference object.Because it is extracted from the reference paper itself,compared with the citation context–based summary,CTS-based summary can reflect the content of reference paper accurately and effectively.In the CTS-based structured summarization framework,CTS need to be firstly identified according to citation context,and then classified based on the citation type between CTS and citation context and finally compressed to obtain structured summary of reference paper.Because the identification of and classification of CTS directly affects the quality of the generated structured summary,in this paper,we will work on the automatic identification of CTS and classification of citation type based on CTS.The automatic identification of CTS includes the following two steps: the solution of the unbalanced dataset and the construction of the CTS identification model.In this paper,the identification of CTS is regarded as the binary classification problem.Because there exist serious discrepancy in the sample size of different categories in the training data,negative sampling is conducted to alleviate this issue.Based on the correlation analysis on the features of CTS,this paper extracts the representative negative samples by calculating the similarity between the sample points,not only ensuring the relative completeness of information,but also mitigating the discrepancy between the number of positive and negative samples.Then,through feature selections on each basic classifier,ensemble models are constructed from three different perspectives: voting-based integration scheme,classifier weight-based integration scheme and integration algorithm-based integration scheme.The experimental results show that the negative sampling method and feature selections are beneficial to the improvement of the results in basic classifiers,and the ensemble schemes work for the improvement of basic classifiers.In the study of citation type classification,our research works are conducted from three perspectives: rule-based classification method,Attention-based Bi-LSTM model method and Labeled LDA model-based method.Due to the limitation of experimental data volume,this paper first constructs a trigger lexicon for each citation type by artificially constructed rules.In the Attention-based Bi-LSTM model and Labeled LDA model,the classification of citation type is realized by supervised machine learning.The experimental results show that Attention-based Bi-LSTM model proves the best overall results on five citation types.This paper conducts the studies of automatic identification of CTS and classification of citation type,which provides new perspectives for other relevant research such as recognition of citation motivation,academic evaluation,and automatic summarization.
Keywords/Search Tags:Cited Text Spans, Citation Type, Text Classification, Negative Sampling, Ensemble Model
PDF Full Text Request
Related items