Research On Abstractive Tibetan Summarization Based On Pre-training Model

Posted on:2024-09-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y M Gao

Full Text:PDF

GTID:2530307079491314

Subject:Applied statistics

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,the amount of information on the network also increases explosively,which also brings a lot of redundant information.In order to facilitate people to get the information they need more quickly and accurately,automatic text summarization technology appears.Automatic text summarization can be divided into extractive summarization and abstractive summarization according to the generation method.Extractive summarization directly selects important sentences as the summary according to the algorithms.Abstractive summarization reorganizes the language as a summary after comprehending the ideas and meanings of the text in the way of human beings.Compared with Chinese and English,the automatic text summarization technology on Tibetan is still in its infancy,so its further exploration and research is necessary.Therefore,this thesis conducts research on automatic text summarization in Tibetan,and the contents are as follows.Firstly,aiming at the problem that there is no public Tibetan summarization dataset for model training,this thesis cleans 42119 data crawled from Tibetan Web pages,and uses Text Rank algorithm to extract key sentences.After removing redundant data according to the characteristics of the corpus,a Tibetan summarization dataset TR＿data,which contains34442 texts and corresponding summaries,is constructed for model training and evaluation.At the same time,this thesis also uses the online Tibetan summarization test set TI＿SUM to further evaluate the model.Secondly,in order to explore the generation effect of abstractive summarization model with different structures,this thesis builds and trains four abstractive Tibetan summarization models based on LSTM model,Transformer model,pre-training model BERT and XLMRo BERTa,respectively.Two evaluation indexes,ROUGE and BERTScore,were used to demonstrate the model performance.Experiments show that Bert based abstractive Tibetan summarization model BERT-abs trained in this thesis has the best performance.Thirdly,this thesis further attempts to optimize the model BERT-abs with an extractive summarization model.Summary generated by the extractive summarization model is better than that by abstractive summarization model in terms of information integrity.In order to make up for the shortcomings of the abstractive summarization model in this respect,I firstly uses BERT to construct the extractive summarization model and then uses the fine-tuned BERT to further train the abstractive summarization model.Finally the model Bert-ext-abs was obtained.At the same time,this thesis designs a comparison experiment from the perspective of training data and model structure to further analyze the performance of the model.Experiments show that the ROUGE score of the proposed abstractive Tibetan summarization model BERT-ext-abs is improved by about 3 percentage points after combining the extractive summarization.Compared with BERT-abs,BERT-ext-abs model sets less requirements on the scale of training data and computing equipment,and has broader application range.

Keywords/Search Tags:

Tibetan, Abstractive summarization, Extractive summarization, Pre-training, BERT

PDF Full Text Request

Related items

1	Research On Summarization Methods For Large RDF Graphs
2	Research On Video Summarization Method Based On Submodular Function Optimization
3	Mining Top-k Summarization Patterns For Knowledge Graphs
4	Research On Efficient Graph Summarization Algorithm Based On Tensor Decomposition
5	The Summarization Of YBE Theory
6	Summarizing Static Graphs and Mining Dynamic Graphs
7	Summarization Of Game Theory's Developments And Functional Analysis' Applications In Game Theory
8	Graph summarization for indexing paths in graph-structured data
9	Research On Intelligent Image Reading Algorithm For Wireless Capsule Endoscopy
10	Analysis of microarray data