Research On Chinese Sentence Similarity Calculation Based On Deep Learning

Posted on:2020-05-09

Degree:Master

Type:Thesis

Country:China

Candidate:H Li

Full Text:PDF

GTID:2428330596475098

Subject:Computer Science and Technology

Abstract/Summary:

With the rapid development of Internet technology and the continuous improvement of information construction in China,the number of Chinese Internet users is increasing and a large amount of Chinese short text data appears on the Internet.As a basic task in natural language processing,sentence similarity calculation plays an important role in information retrieval,text classification,machine translation,intelligent customer service question and answer system,etc.Therefore,it has a very broad prospect and research value.In this paper,we study related technology and network model in Chinese sentence similarity calculation and deep learning,and complete the following work:Firs of all,this paper constructs an abundant Chinese sentence datasets,and performs lots of preprocessing work on these Chinese sentence datasets,including retaining some stop words,performing word segmentation,part-of-speech tagging,named entity recognition,dependency parsing,etc.Secondly,based on the classical neural network model,we improve and propose sentence similarity model for Chinese sentence similarity calculation,which combines the CNN and tensor layer and adopts the dynamic k-max pooling technology.Therefore,the model can extract the feature better and the interaction information between two sentences more effectively which improves the performance of the model.Thirdly,deep neural network is an effective method for sentence similarity calculation task but often requires substantial data to train to fully exploit the performance of the model,while open source Chinese datasets is less and the similarity of two sentences is expensive to label.In order to solve this problem,we improve the sentence similarity model,we design and implement Deep Assistance Neural Network(DANN)model.In this model,we utilize a large amount of unlabeled data to assist in training model parameters.The AdaDelta algorithm is used to optimize the stochastic gradient descent(SGD)method during training,which improves the quality of model training.Finally,we set up several comparative experiments to verify the performance of the models and the feasibility of the strategy in this paper.The experimental results show that compared with the best performing model MV-LSTM in the current baseline models,the sentence similarity model proposed in this paper has better performance in the Chinese sentence similarity calculation work,and improves the F1 value by 0.024.Moreover,quality of DANN model training is improved by the optimization of the AdaDelta algorithm,and the method of using a large amount of unlabeled data to assist the training of model parameters also effectively improves the performance of the model on small-scale annotated datasets.Compared with the sentence similarity model,the F1 value is improved by 0.023,and the F1 value of the model will be further improved as the amount of unlabeled data increases.

Keywords/Search Tags:

Deep Learning, Neural Network, Natural Language Processing, Sentence Similarity Calculation, Word Embedding

Related items

1	Computational Methods Of Sentence Distance Based On Multi-modal Word Embedding
2	Research On Automatic Answering Technique Of English Test
3	Research On Calculation Method Of Chinese-Thai Cross-language Sentence Similarity Based On Word Embedding
4	Research On Sentence Classification Based On Deep Learning And Feature Embedding
5	Sentence-embedding And Similarity Via Hybrid Bidirectional-LSTM And CNN Utilizing Weighted-pooling Attention
6	Joint Learning Methods For Distributed Representations Of Natural Language
7	Research And Application Of Sentence Similarity Calculation Based On Distributed System
8	Research On Machine Learning For Natural Language Processing And Transmission
9	A Sentence Representation Method Based On Syntax And Semantic
10	Unsupervised Extractive Text Summarization Using Sentence Embedding