Font Size: a A A

Research On Automatic Text Abstract System Based On Chinese Long Text

Posted on:2021-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y W LiFull Text:PDF
GTID:2518306503974269Subject:Integrated circuits and projects
Abstract/Summary:PDF Full Text Request
Automatic text summarization is a very important research direction in the field of artificial intelligence.According to a given application requirement,automatic text summarization can be divided into extractable abstract and generative abstract according to different ways of generating abstracts.Because generative abstracts are more similar to artificial abstracts,they have become the mainstream of research in recent years.However,generative abstracts face a serious problem of information loss in Chinese long text applications.This thesis proposes a new model:SSM(Super Segmentation Module).First,the word2vec word embedding model commonly used in previous automatic text summarization methods will cause errors in polysemous text in Chinese.In this thesis,BERT is used instead of word2vec when generating sentence vectors.The deeper network of BERT is used to make the sentence vector it generates contain more information,which improves the performance on long text.Secondly,for long texts containing multiple topics,there is a problem of missing topics in the generative automatic summary model.A topic segmentation module is added to the model,and sentence correlation is calculated using the improved Jaccard algorithm and word2vec algorithm.And can effectively solve the problem of missing subject words.In this thesis,ROUGE is used as the evaluation standard,and Chinese long texts are used as the data set.The number of overlapping phrases in the machine-generated abstract and the reference abstract is counted.Finally,F1of ROUGE-1,ROUGE-2 and ROUGE-L is increased by 40%,60%,and63%in long text datasets larger than 5000 words.It is verified that the improvement of the word embedding layer and the addition of the topic segmentation module can effectively improve the performance of the model in automatic abstract extraction of long texts.
Keywords/Search Tags:automatic text summarization, sentence similarity algorithm, topic segmentation, word embedding
PDF Full Text Request
Related items