Font Size: a A A

Research On Technology Of Cross-language Similarity Evaluation Based On Deep Learning

Posted on:2019-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZuoFull Text:PDF
GTID:2428330548487374Subject:Engineering
Abstract/Summary:PDF Full Text Request
Traditional cross-language similarity evaluation techniques usually rely on theories of linguistics and pragmatics,which are also inevitably related to the features of natural languages.In recent years,the rise of deep learning has continuously promoted the development of many artificial intelligence research fields,such as image recognition,speech recognition,and natural language processing.This paper aims at studying the application of deep learning technology to the cross-linguistic text similarity calculation in Chinese and English,which mainly includes the study of word level and sentence level.The study of word level lies in learning bilingual word representations and constructing bilingual word embedding model by regarding words as text units.Base on this model,bilingual-shared word embedding representations will be produced.The semantic similarity between words can be measured by calculating the spatial distance between vectors.Based on the theory of word embedding and Skip-Gram model,this paper firstly conducts word embedding training on artificially constructed pseudo-bilingual corpus.Secondly,in order to make the words embedding space as complete as possible,this paper also makes use of monolingual corpus as a supplement to learn additional word embedding knowledge.Based on the embedding model,this paper also tries to construct three algorithms by combining the partof-speech information,the topic information and the TF-IDF information with the bilingual word representation respectively.All these three algorithms can be used in cross-language text similarity calculation.Through the combinations,it can overcome the shortcomings of the original method in text semantic representation.The study of sentence level is to use sentence as a text unit.By combining the semantic information of words with the context information of each word,the whole sentence is represented as a vector for the computation of the similarity between language texts.In this regard,this paper proposes a sentence-level based crosslanguage similarity evaluation framework SCLSE.The framework is expressed by the word embedding as the underlying vector representation.It will be used to learn the semantic representations of sentence by integrating a variety of neural network structures.Finally,the similarity score of the sentences is output.By segmenting short texts into paragraphs and regarding paragraphs as long sequences,this paper also conducts the iterative calculation of similarity on a larger scale.According to the above two research points,different contrast experiments are set up to verify the validity and application value of the bilingual word embedding model and the SCLSE framework in the cross-language text similarity evaluation tasks under different text unit granularity.
Keywords/Search Tags:deep learning, cross-language similarity, text unit, bilingual word embedding, semantic representation
PDF Full Text Request
Related items