Font Size: a A A

Research On Key Technologies Of Data Processing For Machine Translation

Posted on:2021-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z J LiFull Text:PDF
GTID:2428330614955022Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rise of deep learning technology,the use of this technology has brought many excellent results to many application fields of machine learning.As the leader of the whole discipline,the technology used in machine translation is undoubtedly the most advanced technology in the field of artificial intelligence.At present,the machine translation model based on neural network technology is the most advanced machine translation model in the world,which is called Neural Machine Translation(NMT)for short.As NMT belongs to a supervised learning technology and has strong learning ability,the quality and scale of bilingual parallel corpus will directly affect the final learning effect of the machine translation model.Therefore,before the training of NMT model starts,there are a large number of processing technologies that need to be carried out on data corpus.Only by processing the original data with these technologies can a new batch of data be obtained to support the training of NMT model,and a better learning effect can be obtained.Through reading a large number of references and books,this paper aims at many technical methods in the data preprocessing stage of NMT model,and puts forward innovative improvements to the sentence division technology,bytes pair encoding and data enhancement technology,so that NMT model can obtain better data resources before training starts,and thus obtain better model performance.In this paper,the Bi-directional Long Short-Term Memory(Bi-LSTM)neural network model is applied to sentence segmentation technology for the first time,and a Thai sentence segmentation model based on Glove+Bi-LSTM+CRF architecture is proposed.Using this model,Thai sentences can be accurately segmented successfully.In addition,this paper also proposes an effective data enhancement method,which can realize effective data expansion from word and sentence levels respectively based on the original bilingual parallel data set,thus improving the performance of NMT model.In addition,this paper also proposes a brand-new bytes pair encoding algorithm architecture.Message Queuing technology is used for the first time to realize information transmission in the process of algorithm execution to ensure information sharing between processes.Methods The method of multi-process joint learning was used to solve the problem of slow vocabulary learning in the bytes pair encoding algorithm.The Thai sentence segmentation model proposed in this paper can reach a F1 value of 98.2% on its corresponding test set,and the segmentation accuracy of the model is significantly better than the experimental results in the same field,which proves the effectiveness of the method.In addition,for the data enhancement technology proposed in this paper,after the data enhancement operation is performed on the basic data set,the BLEU value is increased on multiple test sets,and surpasses the result of the current very effective data enhancement method back-translation.Inaddition,this paper proposes a new technical framework of bytes pair encoding algorithm,which can significantly improve the efficiency of algorithm execution and greatly shorten the training period of NMT model.In short,the method proposed in this paper can greatly improve the translation accuracy and training execution cycle of NMT model,and has good guiding significance for the research and development of the whole NMT model.
Keywords/Search Tags:Neural Machine Translation, Sentence Segmentation, Data Enhancement, Bytes Pair Encoding, model optimization
PDF Full Text Request
Related items