Font Size: a A A

Research On The Extraction Method Of Chinese-Vietnamese Pseudo-parallel Sentence Pairs Based On Image-text Information Enhancement

Posted on:2022-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y F TianFull Text:PDF
GTID:2518306524952539Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Parallel sentence pair extraction is a key task to alleviate the problem of data scarcity in low-resource machine translation,and it is also an important means to improve the performance of machine translation.However,the current methods of extracting parallel sentence pairs are all based on sentence semantic similarity measurement,It does not consider the difficulty of semantic representation of different words in sentences,and mainly focuses on the sentence level,ignoring the document-level context information and the information contained in images.The extracted sentence semantic information is insufficient,and the extracted parallel sentence pairs are of low quality,resulting in poor performance of ChineseVietnamese Neural Machine Translation(NMT).Therefore,aiming at the above problems,this paper studies how to effectively use comparable corpus resources,extract high-quality Chinese-Vietnamese pseudo-parallel sentence pairs from a large number of comparable corpus,and improve the performance of Chinese-Vietnamese neural machine translation under low resources.Firstly,this paper introduces the research status,existing problems and challenges of parallel sentence pair extraction.Then,the characteristics of Chinese-Vietnamese comparable corpus are analyzed,and the acquisition method of Chinese-Vietnamese comparable corpus based on Internet resources is explored Secondly,the method based on semantic adaptive coding is studied from the word level to extract Chinese-Vietnamese pseudo-parallel sentence pairs to enhance the semantic representation ability of sentences Then,the ChineseVietnamese pseudo-parallel sentence pair extraction method based on document-level context information is studied,which makes the model have context-aware ability and improves sentence context information On this basis,the extraction method of Chinese-Vietnamese pseudo-parallel sentence pairs fusing image information is also studied,so that the model can pay attention to text and image together and enrich the semantic information of sentences Finally,a prototype system of machine translation for Chinese-Vietnamese minority languages is developed,and the future research direction and development trend are introduced.The main contributions of this paper are as follows:(1)A semantic representation network framework based on bidirectional LSTM and semantic adaptive coding is proposed.According to the uncertainty of the difficulty of word representation in sentences,the model is guided to use deeper computation.The specific idea is to encode Chinese and Vietnamese sentences first,According to the degree of difficulty of semantic representation of words in sentences,Adaptively represent,deeply mine the semantic information of different words in sentences,realize the deep representation of Chinese and Vietnamese sentences,and then map the vector of deep representation to a unified common semantic space at the decoding end to maximize the semantic similarity between the expressed sentences,thus extracting higher quality Chinese-Vietnamese pseudo-parallel sentences.The experimental results show that the F1 score of the model is improved by 5.09%,which is better than that of the baseline model.At the same time,the extracted sentence pairs are used to train the machine translation model and show a significant improvement in translation performance.(2)In this paper,a method of fusing document-level context information is proposed to extract Chinese-Vietnamese pseudo-parallel sentences with higher quality.The specific idea is to model the context of each sentence under the condition of understanding the document-level global context of the source language and the target language,use four new context encoders to extend the Transformer model to represent the document-level context,and then incorporate it into the original encoder to maximize the semantic similarity between sentences.The experimental results show that the F1 score of this method is improved by 7.15% on the Chinese-Vietnamese document-level data set,which is obviously improved compared with the baseline model.At the same time,the extracted sentence pairs are used to train the machine translation model,and the BLEU value is improved by 0.63 at the highest,which significantly improves the performance of machine translation.(3)At present,the research work of pseudo-parallel sentence pair extraction is based on text,only focusing on the text sentences of the source language and the target language,which is relatively single and ignores the information contained in the image,resulting in the problem that the sentence semantic information extracted by the model is insufficient.In this paper,a Chinese-Vietnamese pseudo-parallel sentence pair extraction method based on image information fusion is proposed.The specific idea is to use the Transformer model to encode the source sentence,Then a new image feature extractor is used to extend the sentence encoder,and the extracted image features are fused into the sentence information representation by using the attention mechanism.The image feature extractor mainly obtains the semantic information of the image as knowledge to enrich the semantic information of the sentence.Experimental results show that the proposed method has a significant improvement over the baseline model on aligned Chinese-Vietnamese text and image data sets.
Keywords/Search Tags:Semantic adaptive coding, Context Awareness, Transformer Model, Document-Level Context, Attention Mechanism
PDF Full Text Request
Related items