Font Size: a A A

Research On The Generation Of Tibetan News Abstracts Based On A Unified Model

Posted on:2021-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:W LiFull Text:PDF
GTID:2438330602998431Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet era,the information on the web presents an explosive growth,which is difficult for us to extract valuable information efficiently.Therefore,text summary technology emerges that could help people to summarize the main idea of the article from the lengthy news.It also could filter out redundant information,so as to improve the speed of browsing news.Text summarization is a research hotspot in the field of natural language processing and has attracted more and more attention of researchers.According to the implementation method,text summarization can be divided into two categories:extractive and abstractive summarization.Extractive approaches generate a summary by selecting and putting the sentences together from the original text.Abstractive approaches generate summaries from scratch with novel words and phrases by re-interpreted not copied from the source text.At present,the research of text summarization has made remarkable achievements in the field of Chinese and English.However,the methods of generating summary and evaluation in Tibetan are still relatively backward.It mainly through the artificial collected a small amount of corpus with unsupervised method to generate summary,which is lack of large-scale corpus.What's more,there is no standard evaluation method.Moreover,the sequence-to-sequence summary model has not been applied in Tibetan,whether it performs well in both Chinese and English.This paper studies and analyzes the Tibetan news text summarization,and its main contents and innovations are as follows:1.in view of the lack of large-scale training corpus,non-standard evaluation methods and lack of reference results in Tibetan at present,50,000 Tibetan news was extracted as training corpus,and the results of K-means clustering and headlines were taken as reference abstracts.The traditional abstract method and understanding abstract method are applied to Tibetan,and the standard ROUGE method of text abstract evaluation is adopted to evaluate,and a reference baseline is given.2.in view of the Tibetan news text is too long to gradient disappeared with the explosion of the problems in the process of training,the joint model,combined extraction method and the generated method,first use the removable method derived from the article can say the sentences of the original,remove redundant information,shorten the length of the article,and then understand type method is used to generate the.The experimental results show that the ROUGE-1 value is improved by about 2%compared with the traditional abstract ROUGE-1 value.3.To solve the problem of the lack of labeled training expectation in the first stage of the joint model,TextRank algorithm was used to label extracted training corpus and train extracted neural network model.In the second stage,pointer mechanism and overwriting mechanism are introduced to solve the problem of semantic duplication in generating abstract.
Keywords/Search Tags:Text abstract, ROUGE, TextRank, Pointer mechanism, Seq2Seq, Attentional mechanism
PDF Full Text Request
Related items