Font Size: a A A

Research And Implementation Of Long Text Semantic Similarity Algorithm

Posted on:2021-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2428330605476055Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of information technology and the rapid popularization of mobile terminals have promoted the transfer of information,and the growing text data has become an important source for people to understand the information.There are more and more application scenarios for text semantic similarity calculation.For short texts,the questions entered in the information retrieval query return the most relevant answers,and the intelligent customer service dialogue returns matching sentences from the back-end database based on the questions raised by the user,Long texts like paragraphs have many applications in news classification,plagiarism discrimination,automatic article scoring,and have certain research value.The development of natural language processing technology provides a method for calculating text similarity.Deep learning models have achieved good results on short text similarity tasks.However,the existing methods are not ideal for long text applications.This is because paragraphs are more complex in composition than sentences,so it is more difficult to calculate the semantic similarity of paragraphs.Through learning and summarizing the existing methods,this article takes paragraphs as an example,and uses different algorithms to calculate the semantic similarity of paragraphs from the two aspects of paragraph semantic vector representation and paragraph text summary.Paragraphs are composed of multiple sentences,and each sentence contains multiple words.Therefore,it can be considered that the semantic representation of paragraphs can be derived from the semantic representation of sentences.Based on this fact,this paper proposes a method of hierarchically constructing information representation to obtain paragraph vectors,mainly including there are word coding,word attention,sentence coding,and sentence attention.The coding uses BiLSTM,attention uses a multi-head attention mechanism,and finally uses CNN to further extract semantic features.After obtaining the vector of paragraph pairs,by calculating the cosine distance between the vectors is used as the similarity score.Compared with long short-term memory networks,the model in this paper has the following advantages:(1)Multi-head attention can extract features from multiple dimensions of sequence data,and aggregate the features of multiple dimensions as the final information representation.Calculate the semantic relevance between any two words in a sentence,which is information that the traditional attention mechanism cannot obtain;(2)Considering the role of convolutional neural networks in local feature extraction,convolutional neural networks are used to further extract local features after sentence encoding.The characteristics of high paragraph dimension and large text context span lead to increased calculation difficulty.If the paragraph dimension can be reduced,the calculation difficulty can be reduced.This article proposes a paragraph similarity algorithm based on generating abstracts.The main purpose is to automatically summarize paragraphs.It is believed that the abstract can represent the semantics of the paragraphs.In this way,the similarity between paragraphs is converted into the similarity of sentence pairs.solve.In this paper,the existing extractive summarization and generative summarization methods are studied,and a hierarchical structured generative text summarization is proposed.Using the encoder-decoder framework,the words are hierarchically encoded at the encoding end,and then the resulting sentence vector is expressed Input to BiLSTM for selection,and use the newly generated sentence-level vector as an intermediate semantic state to pass to the decoding end.The decoding end uses multiple layers of LSTM combined with Attention to decode.Multi-layer recurrent neural network improves the accuracy of generating summaries to a certain extent,and improves the generalization ability of the model.
Keywords/Search Tags:semantic similarity, multi head attention, hierarchical structure, generative summary
PDF Full Text Request
Related items