Font Size: a A A

Research On Calculation Method Of Chinese-Thai Cross-language Sentence Similarity Based On Word Embedding

Posted on:2020-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y H FengFull Text:PDF
GTID:2438330596497532Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
Cross-language sentence similarity calculation plays an important role in text mining,web search,machine translation and question answering system.It has always been an important research content in the field of natural language processing.With the continuous advancement of China's Belt and Road Initiative,there is much cooperation between China and Thai.The increasingly economic and cultural needs between the two countries make language communication more important and urgent.However,language differences have also become an obstacle between the two countries.As Thai language is a scarce resource,Thai language resource is not easy to obtain,and there are few related studies on Thai language processing.Therefore,the study of the similarity between Chinese and Thai cross-language sentences faces great challenges.the work of this paper aims to solve the problem of the similarity calculation of Chinese-Thai cross-language sentences.Mainly from the following three aspects:(1)In the calculation of sentence similarity in Thai language,the calculation method of Thai sentence similarity based on part of speech and word vector is proposed.The method first uses the part-of-speech tagging(POS)result to calculate the similarity of two Thai sentences by considering the part of speech in the Thai sentence,and then converts the words in the sentence into vectors through the word vector training tool,and calculates the two sentences.The similarity of overlapping words.Finally,the word-of-speech and word vector are combined to calculate the similarity of Thai sentences.This method not only considers part of speech but also incorporates semantics.(2)A method for calculating the similarity of Chinese-Thai cross-language words based on non-equal corpus is proposed.The method firstly normalizes the Chinese and Thai monosyllabic vectors,and obtains the initial values of the Chinese-Thai bilingual word-to-vector orthogonal optimal linear transformation.Secondly,by clustering large Chinese corpora,The word pair finds the Chinese word corresponding to each cluster cluster,takes the mean value of each cluster word vector obtained by clustering and the mean value of the word vector corresponding to Chinese and Thai,as a new set of bilingual word pairs corresponding to Chinese-Thai The vector is added to the original bilingual word pair to establish a new bilingual word vector correspondence,so that the original bilingual word pair can be generalized and expanded.Then,using the generalized extended bilingual word pairs,the Chinese-Thai cross-language word embedded mapping model is trained to obtain the optimal mapping matrix W.Finally,the word embedding mapping model is used to realize the mapping of Chinese word vector to Thai word vector space,so as to realize the similarity calculation of Chinese-Thai cross-language words in Thai vector space.(3)A method for calculating the similarity of Chinese-Thai cross-language sentences based on sentence embedding is proposed.The method firstly preprocesses the parallel sentence pairs,obtains the collection of the sentence segmentation of Chinese-Thai,and then uses Word2 Vec to process the two word sets to obtain the word vector set,and uses the integrative weight sentence embedding method to obtain the Chinese-Thai sentence vector.matrix.Finally,the cross-language word vector mapping method proposed in this paper is extended to make it suitable for cross-language sentence embedding model,and to achieve sentence-level cross-language vector space mapping,so as to calculate the similarity of cross-language sentences.
Keywords/Search Tags:cross-language word embedding, sentence embedding, sentence similarity, Chinese-Thai cross-language sentence similarity
PDF Full Text Request
Related items