Font Size: a A A

Research And Implementation Of Automatic Extractive Summarization On Medical Papers

Posted on:2024-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:S Y QiFull Text:PDF
GTID:2568306914982649Subject:Intelligent Science and Technology
Abstract/Summary:PDF Full Text Request
In the field of natural language processing,automatic summarization can compress long documents and extract important information to support downstream information retrieval and information storage.Among the automatic summarization tasks,long document summarization has been one of the research hotspots due to its long input length,difficult semantic analysis,and low compression rate.The paper document is the main representative of long documents,and the number of papers in medical field has been growing rapidly in recent years due to the emergence of the coronavirus disease 2019.Although the existing extractive summarization techniques for long papers have made some progress,the following problems still exist:(1)The latest extractive summarization methods tend to adopt an attention mechanism based on the sequential relationships,ignoring the section relationships of papers,resulting in producing summaries with serious head distribution problems.(2)The specialized domain knowledge of the papers is not considered,and the content understanding is limited,which to some extent affects the system to achieve better results.In this thesis,we have made an in-depth study of the extractive summarization method for medical papers to address the above problems.Firstly,this thesis analyzes the data distribution and optimizes the automatic label-construction algorithm,and observes the structural correlation between the summary and source document statistically.Next,according to the structural distribution and medical entity knowledge,this thesis designs a heterogeneous graph based on explicit writing structure and implicit knowledge structure in the source document.The graph contains sentence nodes,entity nodes,and section nodes as well as semantically rich connection edges that can capture cross-sentence and cross-section logic while analyzing intra-sentence semantic features to generate more comprehensive summaries.At the same time,this thesis provides a large-scale academic paper dataset CORD-SUM which regards coronavirus as its main research content.The experimental results conducted on CORD-SUM show that compared with previous work,SAPGraph can generate more comprehensive summaries with higher similarity scores to the reference.Also,SAPGraph can achieve better results on another multi-domain academic paper dataset arXiv.In addition,this thesis provides a demo system that can perform long paper summarization in real-time and present graph modeling structures.In conclusion,this thesis constructs a structure-and knowledge-aware heterogeneous graph to optimize comprehensiveness and accuracy in extractive summarization,and is able to obtain effective automaticproduced summaries on long medical papers.
Keywords/Search Tags:extractive summarization, heterogeneous graph, long document summarization, medical paper
PDF Full Text Request
Related items