| Question generation is an important research direction in natural language processing tasks.It can automatically generate a coherent and fluent question based on the given paragraph and target answer,and it is widely used in news writing,chat robots and other tasks.Early question generation are mainly based on rules and templates,which have poor diversity and transferability.After that,the question generation mainly adopts sequence-to-sequence models with attention mechanism and copy mechanism.With the emergence of large-scale pre-trained language models,the pre-trained language models represented by BERT have achieved excellent results in multiple downstream tasks of natural language processing,and the pre-trained language model is gradually applied to question generation.At present,the research on question generation in Chinese and English has made great progress,but for the minority languages such as Tibetan,the related research on question generation is still in its infancy.The main reasons are as follows:(1)Although the existing multilingual pre-trained model can be applied to Tibetan by crosslanguage transfer,the effect is not satisfactory.(2)The grammatical rules of Tibetan are complex,and the interrogative words in the questions are ambiguous and their positions are not fixed.On the small dataset,it is difficult for the model to learn complex Tibetan grammatical rules.(3)There is a problem of missing keywords in the Tibetan question generation model,which makes the generated questions unanswerable.To solve the above problems,this paper conducts research on the Tibetan question generation,and the main work is as follows:(1)Construction of Tibetan Pre-training Language ModelThis paper collects and constructs a dataset with a size of 3.56G from 21 websites such as Yunzang Network and Tibet People’s Network,and trains a Tibetan pre-trained language model named TiBERT.To better represent the semantic information of Tibetan,this paper uses Sentencepiece model to segment Tibetan sentences,and builds a vocabulary that can cover 99.95%of the words in the corpus.Finally,the well trained TiBERT is applied to two downstream tasks of Tibetan text classification and question generation.(2)Tibetan Question Generation Based on Tibetan Grammar Knowledge and Fine-grained ClassificationThis paper constructs a model including a question type classifier and a question generator,to solve the problem of inaccurate question words generated by existing question generation models,this paper proposes to use a question type classifier to perform fine-grained classification based on paragraph,answers and Tibetan grammatical knowledge,and to predict the correct types of questions,then the predicted questions types are used as the additional input of the question generator,which is input into the question generator together with paragraphs and answers.(3)Tibetan Question Generation Based on Key Sentence Identification and Structured KnowledgeAlthough the question generation model based on fine-grained classification improves the accuracy of interrogative words in questions,there are still problems of missing keywords in the generated questions,which will make the generated questions unanswerable.To solve this problem,this paper proposes a question generation method based on key sentence identification and structured knowledge.The key sentence identification is used to obtain keyword information near the answer,and the structured knowledge is used to alleviate the impact of labeling errors,and obtain keyword information that is far away from the answer.The structured knowledge is the triple knowledge related to the answer extracted from the sentence in advance. |