| With the in-depth development of information globalization,the application of text summarization technology is no longer limited to high-resource languages such as English and Mandarin Chinese.How to build a high-performance text summarization system in a low-resource environment has become a new research hotspot and difficult issue.Tibetan is one of the minority languages in my country,and some people in Bhutan,India,Nepal and Pakistan also speak Tibetan,a total of about 8 million people speak Tibetan.The development of Tibetan informatization is very important.However,Tibetan informatization started relatively late,and there is currently no effective Tibetan text summarization system;secondly,the wave of intelligence triggered by deep learning has swept the world.In order for computers to accurately Accurate understanding of tasks often requires a large amount of data for training,but Tibetan,as a low-resource language,currently lacks large-scale Tibetan data sets;thirdly,with the increasing abundance of network information,people can no longer be satisfied with just Searching in the same language,the cross-language capability of summarization systems has attracted more and more attention.However,the research on Tibetan cross-language summarization is still in its infancy.These are the problems that the current Tibetan abstract system is facing,so it is of great significance to study the Tibetan abstract system.This paper conducts related research on the Tibetan abstract system.The main innovations of this paper are:(1)We construct 20,000 Tibetan news-headline summaries as a test set.In view of the current lack of public Tibetan text summarization evaluation datasets,we artificially constructed a data set of 1,000 Tibetan text summaries and keyword information corresponding to more than 3,500 articles to assist in the evaluation of Tibetan text summarization systems.The quality of the final abstract is guaranteed by cleaning and scoring the articles.The experimental results show that the Tibetan text dataset we constructed can describe the key information of the article accurately and without repetition,and can be used to evaluate the Tibetan text abstract system.We also constructed a training set of 20000 Tibetan news-headlines.(2)We propose a Tibetan multi-text summarization model based on Improved TextRank.In order to solve the problem that traditional k-means is not strong in text clustering,we adopt two-stage clustering strategy,and use spectral clustering to gather more relevant topics.Then,for the problem that traditional TextRank treats all sentence nodes equally we change the random jump probability of different sentence nodes by fusing topic features,so that sentences more relevant to the topic have a higher probability of being selected by jumping.The experimental results show that our model has achieved 32.4%on ROUGE-L,which is 17.2%higher than the traditional baseline model.(3)In view of the current lack of research on Tibetan cross-language summarization,we propose an end-to-end Tibetan-Chinese cross-language summarization model,which improves the problem of error propagation accumulation in traditional pipeline-based cross-language summarization models.For the lack of Tibetan-Chinese cross-language summarization datasets,we use a back-translation strategy to ensure the quality of the datasets.Through the inductive transfer mechanism of multi-task learning,the target task is disassembled into monolingual summarization task and multilingual summarization task to improve the generalization performance of the model. |