The task of text-based question generation is to select the appropriate content as the answer and generate reasonable and fluent questions based on the given text.Question generation includes the cognitive process such as text semantic understanding and knowledge extraction,which is a challenging problem in the field of artificial intelligence.This task can be used for the downstream tasks that use question answering data samples,such as dialogue systems,question answering systems,machine reading comprehension,and generate data as enhanced data in addition to the existing labeled data to help downstream tasks improve the performance on the existing basis,which has important practical significance and theoretical value.Current research on diverse question generation focuses on increasing the diversity of question expressions,using labeled answers to generate multiple sets of questions with text differences for the same answer by capturing multiple information related to the answer,which lacks the diversity of discussion questions content.This thesis proposes a diverse question generation framework,including content selector,answer selection,question generation and sample generation filter based on neural method,which not only realizes the diversity of question expression,but also enriches the diversity of question content.The innovations of this thesis are as follows:(1)A content selector combining rule and summary strategy is proposed to increase the diversity of question content and question form.One is the content extraction strategy of fusion rules,which evaluates sentence representative by text correlation degree,evaluates difference by similarity difference between sentences,and extracts content by combining two evaluation indexes.The other is the content generation strategy based on summarization,which uses the abstractive summarization method to understand the original text and compress the content,and generates new content text through language reorganization.Compared with the existing method that only get the relevant information from the original text,this thesis adds various forms of text content as the question context,taking into account the richness of the question content and the diversity of the question form.(2)Two types of answer extraction models,entity and sentence,are proposed to increase the diversity of concerns.In this thesis,the answer extraction model obtains many types of candidate answers on the content text,and adopts sentence selection method based on the amount of information to improve the question value of the extracted answers.Compared with the existing work,this method can not only focus on more potential information,but also restrict the low value sentences with less information in the context.By extracting different forms of candidate answers,it improves the flexibility of question content and the diversity of question forms.(3)A generated sample filter based on dual problem is proposed to improve the quality of generated data.The dual question of the question generation task and the question answering system based on the pre-training model are selected as the filters for the samples generated by the model in this thesis.According to the questions and context in the generated data samples,the consistency between the inference answers and sample answers is evaluated,and the current sample quality is judged.Compared with the existing work,this method not only evaluates the quality of generated data,but also increases the ability of answer correction,and improves the consistency of generated data,so it is more suitable for practical needs.Finally,the performance of the proposed model is verified on Chinese data.First of all,using different methods to generate sample data sets,choose intelligence QA as downstream tasks,generate samples analysis as a data strengthen the means to improve the performance of the downstream task model,results show that the model works better than relevant work,illustrates the diversity of data enhancement effect of positive influence.It also reflects that the work of this thesis has good practical value.Secondly,by directly comparing the content of the generated data,this work shows higher problem diversity and data generation validity than relevant work.Furthermore,the influences of diverse data sources on the performance improvement of the model and downstream tasks are analyzed in detail.The results show that a wider range of diverse information can be obtained by using data from different sources to generate problems,which further improves the diversity of problems and the performance of downstream tasks. |