| With the rapid development of Internet technology,the amount of information on the network also increases explosively,which also brings a lot of redundant information.In order to facilitate people to get the information they need more quickly and accurately,automatic text summarization technology appears.Automatic text summarization can be divided into extractive summarization and abstractive summarization according to the generation method.Extractive summarization directly selects important sentences as the summary according to the algorithms.Abstractive summarization reorganizes the language as a summary after comprehending the ideas and meanings of the text in the way of human beings.Compared with Chinese and English,the automatic text summarization technology on Tibetan is still in its infancy,so its further exploration and research is necessary.Therefore,this thesis conducts research on automatic text summarization in Tibetan,and the contents are as follows.Firstly,aiming at the problem that there is no public Tibetan summarization dataset for model training,this thesis cleans 42119 data crawled from Tibetan Web pages,and uses Text Rank algorithm to extract key sentences.After removing redundant data according to the characteristics of the corpus,a Tibetan summarization dataset TR_data,which contains34442 texts and corresponding summaries,is constructed for model training and evaluation.At the same time,this thesis also uses the online Tibetan summarization test set TI_SUM to further evaluate the model.Secondly,in order to explore the generation effect of abstractive summarization model with different structures,this thesis builds and trains four abstractive Tibetan summarization models based on LSTM model,Transformer model,pre-training model BERT and XLMRo BERTa,respectively.Two evaluation indexes,ROUGE and BERTScore,were used to demonstrate the model performance.Experiments show that Bert based abstractive Tibetan summarization model BERT-abs trained in this thesis has the best performance.Thirdly,this thesis further attempts to optimize the model BERT-abs with an extractive summarization model.Summary generated by the extractive summarization model is better than that by abstractive summarization model in terms of information integrity.In order to make up for the shortcomings of the abstractive summarization model in this respect,I firstly uses BERT to construct the extractive summarization model and then uses the fine-tuned BERT to further train the abstractive summarization model.Finally the model Bert-ext-abs was obtained.At the same time,this thesis designs a comparison experiment from the perspective of training data and model structure to further analyze the performance of the model.Experiments show that the ROUGE score of the proposed abstractive Tibetan summarization model BERT-ext-abs is improved by about 3 percentage points after combining the extractive summarization.Compared with BERT-abs,BERT-ext-abs model sets less requirements on the scale of training data and computing equipment,and has broader application range. |