Font Size: a A A

Research On Key Technologies For Tibetan Abstractive Text Summarization

Posted on:2024-09-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:F F LiFull Text:PDF
GTID:1525307172472624Subject:computer science and Technology
Abstract/Summary:PDF Full Text Request
Tibetan is the language and script for communication and exchange among Tibetans,and Tibetan is the writing system of Tibetan,with unique word formation and grammatical structure rules.With the development of Natural Language Processing(NLP)technology,more and more NLP techniques have been applied to the field of Tibetan information processing,and these researches can promote communication between ethnic groups and help people understand more about the characteristics of the language.With the rapid development of Internet technology and intelligent terminals,Tibetan information shows a trend of rapid growth,which provides an opportunity for the construction and research of the corpus of various tasks of Tibetan information processing and also brings challenges to the storage and management of data,how to organize these data effectively and quickly distill the critical information is essential.In recent years,Tibetan information processing technology has received attention from researchers.Still,the development of summary generation technology has been slow,mainly in the following ways: firstly,in NLP technology,the research of downstream tasks is more dependent on the basic tasks,the electronic resources of Tibetan are limited,and the study of tasks such as sentence boundary disambiguation and word segmentation is slow,and most of the researches are carried out based on rule-based methods,which requires high linguistic fundamentals of the researchers,and the data and evaluation rules are not uniform,which limits the development of summarization technology;secondly,the structure and grammar of Tibetan characters have their characteristics,and in the process of Tibetan information processing,researchers are required to explore methods suitable for the Tibetan characteristics,instead of directly adopting the research methods of other languages.With the popularity of deep learning technology,researchers have begun to explore the Tibetan information processing based on deep learning,which is conducive to improving the efficiency and accuracy of Tibetan information processing,and promoting the development of informatization and digitization of Tibetan.Aiming at the Tibetan text summarization,this study investigates from three aspects,namely,sentence boundary disambiguation(SBD),summary corpus construction,and summary generation technology.This study firstly researches the Tibetan SBD based on the deep learning method and applied in the sentence segmentation task;then,based on the SBD results,the sentences are segmented with mature word segmentation methods,and the critical sentences are extracted from the document regarding the Text Rank algorithm,and then verified the quality of the dataset based on BERT extractive summarization;finally,the diverse beam search(DBS)algorithm is introduced,and the abstractive text summarization of Tibetan based on BERT and DBS(BERT-DBS)is introduced to reduce the redundancy of the generated summaries.The main research work of this study is as follows:(1)Studied the syllable-level Tibetan SBD based on the attention mechanism.Punctuation marks in Tibetan have special functions and ambiguities,and it is impossible to judge the end of a sentence only by punctuation marks.Rule-based SBD methods are not standardized and the coverage ability of rules are limited,which cannot handle large-scale corpus well.Recurrent Neural Network(RNN)takes sequence data as input and can recursively follow the sequence,and the attention mechanism has the function of adaptively learning the importance of different positions in the input sequence,which can effectively improve the performance of the model;this study investigates the combination of RNN and the attention mechanism on Tibetan SBD,and it takes syllables as the training unit.Firstly,the vectors of Tibetan syllables are generated by the Word2 vec model;secondly,the window is introduced to set the length of the sequences participating in the training regarding the rule-based SBD method;finally,generalizability experiments are conducted on English,German,and Thai,and the experimental results show that the method effectively overcomes the dependence of the rule-based SBD method on word segmentation and part of speech(POS)tagging technology.Compared with the sequence tagging method,the F1 values are improved by 1.55% to 10.53% on the SBD task.At the same time,the experiments based on other languages effectively demonstrate the generalization ability of the SBD model in this study.(2)Investigated the component-level Tibetan SBD based on Bi-directional Long Short-Term Memory(Bi LSTM).Tibetan characters have horizontal and vertical writing rules,and their character construction rules is more complex than those of Chinese and English.To address the problem of instability of the component-level SBD task in the previous study,a component-level Tibetan SBD method that considers text information on both sides of the punctuation mark is proposed,which considers both the “above”and “below” of the punctuation mark in the sequence input process.In the experimental stage,the concept of the window was introduced,and only the component(characters)with the number of windows before and after the punctuation mark were selected to participate in the training.Firstly,The SBD of Tibetan text considering only the left and right-side text and considering the text information on both sides were compared;secondly,the SBD experiments were carried out in English,German,Turkish,and Romanian.From the experimental results,the F1 value of the SBD method based on the text information on both sides of the punctuation mark stays around 96%,and the experiments on larger datasets and multiple languages illustrated the model’s generalization ability.(3)Building a Tibetan summary dataset based on the Text Rank algorithm and verify its quality.Based on the above research on Tibetan SBD,we constructed a long text summarization dataset for the Tibetan news domain with reference to the Text Rank algorithm;and built a short text summarization dataset for Tibetan according to the Lead-3 idea.To verify the quality of the dataset,the extractive text summarization of Tibetan on BERT pre-trained language model based on Sentencepiece and BPE subword cutting methods and the publicly available pre-trained language model Ti BERT.The experimental results show that the performance of the summary dataset built based on the Text Rank algorithm is better than the data generated by the Lead-3idea.Meanwhile,Sentencepiece is more suitable than BPE for subword cutting in the Tibetan BERT pre-trained language model.The ROUGE-1 and ROUGE-L of BERTclassifier,BERT-Transformer,and BERT-RNN models improved by nearly ten percent,and ROUGE-2 improved by eight percent over Transformer,respectively.(4)An abstractive text summarization method for Tibetan based on BERT-DBS has been studied.For low-resource languages such as Tibetan,which have limited training data,it is more difficult to extract the semantic information of words during the training process,and it is failed for the generated summaries to express the original document’s meaning adequately.The BERT-DBS method adopts the BERT pre-training language model for encoding,Transformer for decoding,and introduces the DBS in the process of summary generation,which generates multiple candidate sequences through the diversity factor and selects the best candidate sequence as the output.The method has experimented on the Tibetan summary dataset on the BERT pre-trained language models with two subword cutting method(Sentencepiece and BPE)and publicly available Ti BERT.The experimental results show that the summaries generated by the BERT-DBS method can fully capture the original text meanings,effectively avoiding the high replication ratio of the generated summaries and improving the performance of Tibetan summary generation.The ROUGE-1,ROUGE-2,and ROUGE-L of BERTDBS on long text summaries compared to BERT,respectively,improve 6.58%,3.85%,and 3.77%,and ROUGE-1,ROUGE-2,and ROUGE-L on short text summaries improved by 2.97%,0.63%,and 1.24%,respectively,compared to BERT.
Keywords/Search Tags:Tibetan information processing, sentence boundary disambiguation, attention mechanisms, pre-trained language model, text summarization, diverse beam search
PDF Full Text Request
Related items