Font Size: a A A

Research On Key Techniques Of Cross-Language Text Similarity Detection Based On Word Vector

Posted on:2020-12-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:F GuoFull Text:PDF
GTID:1368330596493906Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Cross-language text similarity detection plays an important role in many cross-language processing applications and their related fields.Cross-language text similarity detection is to determine the degree of semantic similarity between two given segments of text in different languages.This similarity judgment is insignificant for human beings with bilingual skills,it is a difficult and profound issue,however,for constructing algorithms and computational models to simulate human beings' cognitive level in natural language processing.Cross-language text similarity detection originates from and is more difficult than single-language text similarity detection,because different languages are from different artificial symbolic systems,which leads to differences of morphology,grammar,syntax and expression structure.The usual practice is to use machine translation or cross-language text mapping.The disadvantage of machine translation is that it has not reached the human level,and the translation process will functionally lose some of the semantic information and cause inaccurate translation.Direct cross-language text mapping will lead to excessive semantic granularity between bilinguals,and hierarchical semantic features of languages cannot be accurately reflected.In addition,other linguistic phenomena,such as polysemy and OOV(out of vocabulary),can directly affect the accuracy of cross-language text similarity detection.Therefore,to solve the problems existing in cross-language text similarity detection,this thesis presents an innovative method of stacked generalization ensemble learning based on word vector and extended deep semantic features to traditional language features.The purpose is to reduce the loss of cross-language semantic extraction process.The similarity detection is accomplished by constructing lexical and sentence-level features without the aid of a machine translation system.The key technologies of cross-language similarity detection include: the construction of multi-sense word vector,the construction of cross-language word vector,cross-language feature engineering and similarity measure.The main innovative research results of this paper are as follows:1.A multi-sense word vector model MSCVec(Multi-sense Soft Cluster Vecter)based on non-negative matrix decomposition and sparse soft clustering is proposed.MSCVec model is a single-language word vector model.It uses the non-negative matrix decomposition of positive pointwise mutual information between word and context to extract the low-rank representation of the mixed semantics of words,and then divides the multiple meanings of polysemy through sparse soft clustering algorithm,and also obtains the membership of polysemy in the global sense.According to the negative average log-likelihood of calculating context semantics and polysemy global membership,the specific polysemy cluster is determined,and finally the multi-sense vector is trained by using Fasttext model under the word set of extended dictionary.The advantage of MSCVec model is that it is an unsupervised learning process without the help of any knowledge base,and the substring embedding in the model guarantees the word vector that can represent the OOV words,and the MSCVec model can also weight the polysemy word embedding to a single-sene embedding by the global membership.Compared with the traditional static word vector,MSCVec shows excellent results in the experiment of word similarity and downstream text classification task.2.A cross-language word vector model SCLvec(Siamese Cross-Lingual vector)based on sparse attention alignment model and Siamese network joint training is proposed.The SCLvec model is a cross-language word vector model for sharing embedded word space,which is based on parallel corpus learning.It does not require cross-language dictionary information or expensive word alignment,but only uses sparse attention mechanisms to achieve lexical granularity alignment/mapping.In order to maximize the semantic similarity between lexical level and sentence level,the model uses the method of Siamese Recurrent Neural Network joint training to obtain cross-language word vectors by freezing the word vector layer of one of the input,and the Siamese network jointly updating the embedding layer of another input.The SCLVec model is superior to other models in the experimental results of bilingual synonyms and zero-shot transfer text classification in English and Chinese.3.A cross-language sentence-level semantic similarity detection method based on feature expansion is proposed.To solve the problem of incomplete semantic granularity of cross-language sentence feature representation,the MSCVec multi-language word vector and SCLVec cross-language word vector are used as the embedded layer of(pseudo)siamese network to train the deep semantic features of cross-language sentence sub-levels.And then the external resources are used to obtain the traditional statistical similarity language features cross languages.Then the fusion extension of the two sets of features becomes the new semantic feature,and the similarity classification experiment and the stacked generalization experiment are designed to compare.The experimental results show that in the cross-language sentence-level similarity detection task,(1)as the input embedding layer,SCLVec cross-language word vector lexical features are better than MSCVec multi-sense word vector features;(2)deep semantic sentence-level features of siamese recurrent neural network training is better than siamese convolutional neural networks;(3)the expansion of traditional statistical features,especially the cross-language topic model(BL-LDA),can effectively improve the performance of cross-language similarity detection;(4)stack generalization ensemble learning can maximally reduce the error rate of the basic classifier and improve the detection accuracy.
Keywords/Search Tags:multi-sense word vector, cross-language word vector, siamese network, feature expansion, text similarity detection
PDF Full Text Request
Related items