Font Size: a A A

Research On Semantic Similarity Of Short Text Based On Time Warping Distance

Posted on:2021-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2428330629986199Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the popularity and development of mobile intelligent terminal devices and social networks,a large number of short text data such as news summaries,microblogs,commodity reviews,etc.emerge.How to mine the information with commercial value from the massive short text data has become a topic of concern for many Chinese natural language processing scholars.Text similarity,as the core work of artificial intelligence business applications such as machine translation,emotional analysis,information retrieval,plays a huge role.This paper focuses on how to accurately mine the semantic features of ambiguous words in Chinese short texts,and how to effectively combine them with the whole word order structure of texts to carry out a series of researches on similarity calculation,the main works are as follows:(1)Traditional methods based on character statistics can only make shallow statistics on text words,the methods based on semantic dictionary and syntactic dependency analysis have the problems of strong subjectivity in feature expression and limited knowledge base.To solve these problems,this paper proposes a semantic similarity calculation method based on Word2 vec and improved DTW algorithm and Hungarian algorithm.Large scale text corpus is trained by Word2 vec to obtain word vectors that objectively express word feature information,the word vector is converted to a point in space,the word vector sequence is converted to a point sequence in space,and the alignment distance between the curves connected by the point sequence is calculated by the DTW algorithm of weighted common subsequence length optimization and the Hungarian algorithm,according to the principle that the smaller the alignment distance is,the higher the similarity degree is,to calculate the similarity between short texts.(2)In order to solve the problem that static word vectors can not effectively combine with the current context to distinguish the feature information expression of ambiguous words,this paper proposes a semantic similarity calculation method based on BERT and Time Warping Distance.Based on the special mask training mechanism and self attention semantic enhancement mechanism of BERT model,the semantic features of short text are extracted from the whole level,the whole feature vector of the extracted short text is converted to a sequence of points in space,and the time warping distance between the curves connected by the sequence of points is calculated by CTW algorithm,according to the principle that the smaller the Time Warping Distance is,the higher the similarity degree is,to calculate the similarity between short texts.The experimental results show that,the similarity calculation method of Word2 vec combined with improved DTW algorithm and Hungarian algorithm proposed in this paper can classify the short texts with similar semantics according to the disorder degree of word order,and calculate the reasonable and effective similarity for the short texts in general scenarios.The similarity calculation method of BERT combined with Time Warping Distance proposed in this paper can well mine the feature information of ambiguous words,and effectively calculate the similarity between short texts.Compared with the other methods,it has a more accurate distinction between short texts with lexical ambiguity.
Keywords/Search Tags:Word2vec, DTW, BERT, Time Warping Distance, Semantic Similarity
PDF Full Text Request
Related items