Research On Semantic Similarity Calculation Of Chinese Short Text Based On Deep Learning

Posted on:2019-04-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Bai

Full Text:PDF

GTID:2428330566491393

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the popularity of the Internet,researchers have paid more and more attention to the semantic similarity calculation of Chinese short text in the field of natural language processing.Relative to English,Chinese is ideographic,without strict grammar.Short text has the characteristics of short length,diverse expression and irregular grammar structure.Traditional processing methods have the problem that textual features are sparse and semantic information is lost.The existing deep learning method solves some problems of the traditional methods,but ignores the characteristics of Chinese short texts.This article based on the deep learning method,for the characteristics of Chinese short text,mainly done the following aspects:(1)The Chinese short text semantic similarity data set management system was constructed.The quality of the deep learning model depends largely on the quality of the training data.At present,there are some English text similarity training sets both at home and abroad,but there is no training set for Chinese short texts.Therefore,this article builds this system to compose the Chinese short text semantic similarity training and test data set by the common users sharing the main sentence and replying to the main sentence.A main sentence,a reply,and a similarity value constitute a set of data.We collected a total of 12769 pairs of data.(2)The Chinese short text semantic similarity computation model based on Stop Words and TongyiciCilin is constructed.At present,most Chinese semantic similarity computation models need to remove Stop Words,but Stop Words have an important role in Chinese word segmentation,voice analysis,and semantic similarity calculation.Different from the previous methods,we reserve the Stop words in the training corpus of word vector for Chinese characteristics,and add the TongyiciCilin to the training data of the semantic similarity computation model.The influence of Word2vec and Glove methods on model training is also compared.The results showed that the retention of Stop Words and the addition of TongyiciCilin increased the accuracy of the model by 2%-3%.(3)The semantic similarity calculation model for Chinese double sequence of short texts is constructed.In recent years,these models are single sequence models when applied to Chinese short texts,and do not take into account the influence of synonymous words,synonyms and phrases Semantic and other semantic ambiguity.In order to overcome the problem,we proposed the Chinese double sequence of short texts that has two identical LSTM processing two text sequences at the same time.Finally,the proposed model was compared with the Semantic Text Similarity model based on CNN and the Baidu Semantic Text Similarity model.The results show that this model is better than 6%or more in accuracy,recall,etc.

Keywords/Search Tags:

Deep Learning, Chinese Short Text, Semantic Similarity Calculation

PDF Full Text Request

Related items

1	Research On Semantic Similarity Calculation Of Chinese Short Text
2	Research And Application Of Short Text Semantic Similarity Model Based On Deep Learning
3	Research On Analysis And Computation Methods For Short Text With Deep Learning
4	Research On Semantic Similarity Calculation Method And Data Augmentation In Chinese Short Text
5	The Study Of Measures And Applications Of Short Text Semantic Similarity
6	The Research And Application Of Unsupervised And Supervised Short Text Similarity Measure
7	Mongolian Short Text Semantic Similarity Calculation Based On Deep VAE Integrated With Topic Information
8	Research On Computing Method Of Chinese Sentence Similarity Based On Deep Learning
9	Deep Learning For Short Text Semantic Similarity Measures
10	Research And Implementation Of Semantic Similarity Computing By Combining Knowledge-based And Corpus-based Methods