Font Size: a A A

Keyword-based Automatic Short Text Generation

Posted on:2021-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2428330605964089Subject:Computer applications
Abstract/Summary:PDF Full Text Request
In recent years,with the development of deep learning technology,many tasks in the field of natural language processing have come up with a new solution——method based on deep neural networks.As one of the most important works of natural language processing,text generation aims to generate high-quality text,but the sentences generated by the traditional rule template method are stiff and far from natural language.With the application of deep neural network technology in text generation tasks,sentences have become more flexible and the quality of the generated text has been greatly improved.Text generation is currently a hot topic.Although different corpus for training will result in different writing styles,the subject of the text content is difficult to control.According to people's writing habits that we conceived the theme first and then filled it out,I designed an automatic keyword-based short text generation system to deal with the above problems.However,due to the novelty of the subject,difficulties still exist in three aspects:out of vocabulary,difficult to express keywords,and text quality assessment:First,out of vocabulary.Because the quality of the corpus is not high,most keywords appear less frequently in the corpus.During the static word vector representation,taking into account the lack of computer memory,low frequency words will be<UNK>alternative,whose text features is difficult to learn during the word vector's training,.Since the process of generating the text,all the information come from the source keyword.when the keyword is too<UNK>alternatively,short essay generated by lack of information obviously does not meet the requirements of the subject.Second,the keywords are difficult to express.While ensuring the quality of the generated essay,this topic aims to make keywords or similar words appear in the generated essay.However,the position of the keywords in the corresponding short text is basically different,and the sequence-to-sequence model commonly used in text generation tasks has insufficient alignment ability which makes keywords difficult to be expressed in short text.And when constructing the training set,part of the sample length exceeds the set threshold and needs to be truncated,which will cause the keywords information in the truncated text to be lost,further increasing the difficulty of expressing the keyword in the short text.Third,text quality assessment.The existing text quality evaluation algorithms focus on the difference between generated text and natural language text.However,while ensuring the quality of the generated essay,the subject also needs to confirm whether the keywords are expressed in the generated essay,which is also part of the text quality assessment of the subject.Aiming at the above three aspects,based on the theory of dynamic word vectors and attention mechanism,this paper has achieved the following innovative results:(1)ELMo-based word vector representation methodAiming at the problem of words out of vocabulary,this paper uses the dynamic word vectors to grant different words for different vector representation features in different contexts,constructs an ELMo-based encoder,and proposes a dynamic word vector-based representation method.There is a certain degree of similarity between the keywords in the same set of samples in the corpus.For keywords out of vocabulary,a dynamic vector suffix will be given according to the context,so that the text features of the keyword can be learned,thus make the source of the short text information more sufficient,and the quality of the generated text higher.Simulation experiment results show that this method is better than static word vectors,but the effect is not good when the correlation between keywords is low.(2)Semantic alignment method based on dual attention mechanismAiming at the difficult problem that keywords are difficult to be expressed,this paper constructs a dual-attention sequence-to-sequence model based on features with different attention degrees to different input information through the attention mechanism,and proposes a semantic alignment method based on the dual attention mechanism.It can more accurately determine the degree of keyword participation in the generated essay,so as to obtain the contribution degree that the keywords can provide for the subsequent essay,thereby making the keyword easier to be expressed in the essay.Compared with the most effective MTA-LSTM model on the same training set,this method generates higher text quality evaluation.(3)Text quality evaluation method based on similarityAiming at the problem of text quality assessment that meets this topic,this paper builds a similarity model between generated essays and keywords by comparing the features between generated essays and keywords,and proposes a text quality assessment method based on similarity.Experiments show that the results of using this method to select short texts on the validation set are more in line with the requirements of this subject.To sum up,this paper proposes a word vector representation method based on ELMo,which improves the accuracy of word vector representation,so that words outside the library can be characterized to a certain extent;a semantic alignment method based on the dual attention mechanism is proposed,The possibility of keywords being expressed in generated short essays is improved;a text quality evaluation method based on similarity is proposed,which can more easily select short essays with better verification concentration effect.Overall,the quality of short essays based on keywords has been improved to achieve the desired goal.
Keywords/Search Tags:NLP, Dynamic word vector, Attention mechanism, Keywords
PDF Full Text Request
Related items