Research On The Translation Of Out Of Vocabulary Words In The Neural Machine Translation For Chinese And English Patent Corpus

Posted on:2018-08-17

Degree:Master

Type:Thesis

Country:China

Candidate:X K Zheng

Full Text:PDF

GTID:2348330512493297

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The purpose of machine translation(MT)is to find a target language sentence which keep the closest meaning to the source language.In essence,the MT completes a sequence-to-sequence task.In recent years,with the development of technology of deep neural network(DNN)in speech recognition and image processing and other aspects,researchers began to use DNN to deal with symbolic variables,such as the machine translate task in natural language processing.Neural Machine Translation(NMT)contains two neural networks,the encoder and the decoder.The encoder transforms the source language into a vector representation.The decoder generates the word sequence of the target language according to the vector representation of the source language and the historical information of the target language.For controlling the computational complexity,most NMT systems choose to limit the size of source language dictionary and target language dictionary.The general dictionary size is usually between 30K to 80K,for the words that are not included in the dictionary,which is called Out of Vocabulary(OOV)words,using symbol<UNK>instead.Out of vocabulary words brings several problems,one is that during the test process,the model can not generate appropriate translation results.The other one is that OOV words lead to the source language sentence semantics can’t be correctly expressed,which increase the ambiguity of the translation results.The third one is that the source language and the target language sentence structure of the training corpus are seriously damaged,the quality of the neural network parameters is not high.In the patent literature,there are a lot of low frequency words,which leads to the problems more serious.In this paper,we adopt Chinese and English patient corpus,for improving the performance of neural network translation method.We proposed a method to improve the performance of OOV words of NMT,to progress the patent documentation translation results.The main research achievements are shown as follows:(1)Introduce the statistical machine translation’s alignment information,add the corpus dictionary in the form of external information,and translate the OOV words according to the attention mechanism in the neural network translation.(2)Label the technical terms in corpus,and translate technical terms in the way of pre-processing and post-processing.(3)Adding multi-model fusion mechanism,and training multiple translation models,at the time of decoding,according to the results obtained by multiple translation models to score,select the best results.This paper puts forward the experimental results of Chinese and English patent corpus,which shows that the method proposed in this paper can effectively deal with OOV words and patent terms,so that the translation results are improved.

Keywords/Search Tags:

Neural Machine Translation, Out of Vocabulary Words, Patent Literature

PDF Full Text Request

Related items

1	Robustness On Neural Machine Translation
2	Mongolian-Chinese Neural Machine Translation Based On The Fea-tures Of Statistical Machine Translation
3	A Study On Unknown Words Processing In Mongolian-Chinese Neural Machine Translation
4	Research On Chinese-Myanmar Neural Machine Translation Method Integrating Bilingual Dictionary
5	Research On Unknown Word Processing In Neural Machine Translation
6	Research On Term Automatic Translation Technology In English-Chinese Machine Translation System
7	Research On Unknown Words Processing Method In Neural Machine Translation Using Semantic Concept
8	Methods For Handling OOV In Chinese-uyghur Neural Machine Translation
9	Research On Example-Based Automatic Machine Translation For English-Chinese Patent
10	Research On Chinese-Myanmar Neural Machine Translation Method With Monolingual Corpus