Font Size: a A A

Research On Similarity Comparison Of Cross Language Texts Based On Multi-language Embedding

Posted on:2022-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:K WangFull Text:PDF
GTID:2518306350981809Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Cross language similarity text research is usually based on cross language word vector,and then word meaning features are applied to semantic extraction.At present,the construction of cross language word embedding space is mainly realized through mapping relations of different language spaces.However,when the similarity of embedded structures in different language spaces is low,the effect of using mapping relation is not good.But in the research of cross language semantics.Due to the different languages and the influence of polysemy,it is difficult to extract and integrate the features of cross language sentences.However,only using word level single feature to extract semantics is not effective.In order to solve the problem of cross language word representation,this paper proposes a Shared Word Embedding Space Based on Pseudo Corpus(SEB-PC).This method uses a word alignment technology named GIZA++ to obtain the mapping word relationship of parallel corpus.Through the mapping relationship,a bilingual pseudo corpus construction algorithm is proposed.Combined with the training process of skip gram,the distance of mapped word pairs can be shortened in the bilingual word embedding space.In addition,this paper proposes a pseudo trilingual corpus construction algorithm based on bilingual pseudo corpus,and constructs the trilingual shared word embedding space.Compared with bilingual word embedding space,trilingual word embedding space can capture more word embedding positions among languages.Finally,the SEB-PC method is used to do word similarity experiments and word translation experiments on multiple language pairs.Compared with the embedding method based on mapping relationship,the SEB-PC method achieves more stable experimental results in the experiment of long-distance language pairs.To solve the problem of cross language semantic feature extraction,this paper proposes a Cross Language Feature Capture Model on Similarity Matrix(FCM-SM).Compared with the single feature extraction method,the model also adds phrase level features.In the experiments of cross language repetition recognition and cross language sentence alignment,FCM-SM model is better than single feature extraction method and other cross language models.The SEB-PC method and FCM-SM model proposed in this paper are used to solve the problem of word representation and semantic feature extraction in the study of similarity.The effectiveness of the method and model is proved by experiments in different language pairs.The experimental results are carried out in homologous and nonhomologous languages comparison of the two methods.
Keywords/Search Tags:Cross language, Pseudo corpus, Shared word embedding space, Similarity matrix, Feature extraction
PDF Full Text Request
Related items