Font Size: a A A

Bilingual Parallel Corpus Filtering Method Based On Siamese XLM-R Neural Networks And Feature Fusion In Machine Translation

Posted on:2024-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:J TuFull Text:PDF
GTID:2568307112476414Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Machine translation refers to the process of translating source language sentences into semantically equivalent target language sentences through computer technology,and it is an important research direction in the field of natural language processing.In machine translation,the quality of bilingual parallel corpus plays a vital role in the performance of machine translation models,and training high-quality machine translation models requires large-scale high-quality bilingual parallel corpus.However,current machine translation research lacks high-quality bilingual parallel corpus in many cases,and the cost of manually constructing large-scale,high-quality bilingual parallel corpus is extremely high.Therefore,it is of great significance and value to extract bilingual parallel corpus from bilingual comparable corpus by using automatic filtering methods.In recent years,the automatic filtering method of bilingual parallel corpus mainly measures the quality of bilingual corpus according to the degree of semantic matching of bilingual texts.In view of the problem that the bilingual parallel corpus automatic filtering method used by the predecessors is insufficient in extracting text semantics,this paper proposes a machine translation bilingual parallel corpus filtering method based on siamese XLM-R model,which maps source language sentences with possible target language sentences to deep semantic space by using siamese neural network based on cross-language pre-trained language model XLM-R.The average pooling operation is used to obtain sentence representations of the same dimension,and bilingual comparable corpus with high semantic similarity are extracted according to the cosine distance between sentence representations,which improves the performance of bilingual parallel corpus filtering by extracting deeper and more well-represented source language sentences and their possible target language sentence encoding representations.In addition,in the bilingual parallel corpus filtering method based on classification,in order to indirectly improve the performance of bilingual parallel corpus filtering by improving the accuracy of the classification system,we propose a machine translation bilingual parallel corpus filtering method based on feature fusion,which feeds the source language sentences and possible target language sentences into the UNQE model on the basis of the XLM-R classification model,the sentence-level quality features are obtained through the average pooling and maximum pooling operations.After that,we deeply fuse the average quality features,sentence-tomaximum quality features and sentence features extracted by XLM-R.Finally the fused features are used for effective classification.This paper conducts experimental verification on the WMT18 bilingual parallel corpus filtering task,and the results show that the proposed method is superior to the baseline method and is comparable to the systems participating in the evaluation.This is because the XLM-R network can efficiently encode source language sentences and the target language sentences that may match the source language sentences,and the siamese network can better characterize the differences between text.In addition,the classification system that integrates additional sentence-pair quality features can improve the classification effect and improve the filtering performance of bilingual parallel corpus.
Keywords/Search Tags:machine translation, automatic filtering of bilingual parallel corpus, siamese neural network, XLM-R mode, contrastive loss
PDF Full Text Request
Related items