Font Size: a A A

Research On Semantic Similarity Calculation Of Chinese Short Text Based On Deep Learning

Posted on:2019-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y BaiFull Text:PDF
GTID:2428330566491393Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet,researchers have paid more and more attention to the semantic similarity calculation of Chinese short text in the field of natural language processing.Relative to English,Chinese is ideographic,without strict grammar.Short text has the characteristics of short length,diverse expression and irregular grammar structure.Traditional processing methods have the problem that textual features are sparse and semantic information is lost.The existing deep learning method solves some problems of the traditional methods,but ignores the characteristics of Chinese short texts.This article based on the deep learning method,for the characteristics of Chinese short text,mainly done the following aspects:(1)The Chinese short text semantic similarity data set management system was constructed.The quality of the deep learning model depends largely on the quality of the training data.At present,there are some English text similarity training sets both at home and abroad,but there is no training set for Chinese short texts.Therefore,this article builds this system to compose the Chinese short text semantic similarity training and test data set by the common users sharing the main sentence and replying to the main sentence.A main sentence,a reply,and a similarity value constitute a set of data.We collected a total of 12769 pairs of data.(2)The Chinese short text semantic similarity computation model based on Stop Words and TongyiciCilin is constructed.At present,most Chinese semantic similarity computation models need to remove Stop Words,but Stop Words have an important role in Chinese word segmentation,voice analysis,and semantic similarity calculation.Different from the previous methods,we reserve the Stop words in the training corpus of word vector for Chinese characteristics,and add the TongyiciCilin to the training data of the semantic similarity computation model.The influence of Word2vec and Glove methods on model training is also compared.The results showed that the retention of Stop Words and the addition of TongyiciCilin increased the accuracy of the model by 2%-3%.(3)The semantic similarity calculation model for Chinese double sequence of short texts is constructed.In recent years,these models are single sequence models when applied to Chinese short texts,and do not take into account the influence of synonymous words,synonyms and phrases Semantic and other semantic ambiguity.In order to overcome the problem,we proposed the Chinese double sequence of short texts that has two identical LSTM processing two text sequences at the same time.Finally,the proposed model was compared with the Semantic Text Similarity model based on CNN and the Baidu Semantic Text Similarity model.The results show that this model is better than 6%or more in accuracy,recall,etc.
Keywords/Search Tags:Deep Learning, Chinese Short Text, Semantic Similarity Calculation
PDF Full Text Request
Related items