Research On Key Technologies Of Parallel Corpus Construction In Machine Translation Based On Pre-Training Model

Posted on:2024-04-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Huang

Full Text:PDF

GTID:2568306920986649

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the deepening of globalization and increasingly close international communication,Machine Translation(MT)has become a hot topic nowadays.There is a growing demand for excellent machine translation systems,Therefore,it is urgent to build an excellent machine translation system.As the data base of machine translation system,parallel corpora has been paid more and more attention by researchers,Machine translation parallel corpora usually show the characteristics of large amount of data,complex corpus categories,repeated data,no annotation and other difficult to use.Taking Chinese-English parallel corpora as an example,this thesis studies three tasks: weight removal,corpus classification and corpus quality assessment in the construction process.In view of the problems such as long time consuming and poor ability to recognize repetitions when processing large-scale short texts by traditional text rescaling methods,several popular text rescaling algorithms are selected to optimize the corpus structure and rescaling algorithms based on the parallel corpus scenario.By comparing the execution time of several algorithms and finding the number of duplicate data,it is finally proved that the parallel corpus weight removal algorithm based on Simhash has obvious advantages.Based on the current popular pre-training models,a parallel corpus classification model is proposed,which integrates the pre-training models.This model fully taps the characteristics of different languages,uses different pre-training models to extract language features,and then completes feature fusion through TextCNN.This model overcomes the defect of traditional classification algorithms which only target at single language data,causing a lot of data to be wasted.The best results were achieved in the comparative experiment,which not only fully utilized the bilingual data in the corpus,but also took into account the particularity of different languages.As for the high time and economic cost of using translation software in the quality assessment of parallel corpora,the translation results themselves have errors,which will be further amplified in the subsequent similarity calculation.Besides,there are few researches on the cross-language similarity calculation direction without annotated data,and the solution of parallel corpus quality assessment is proposed.Based on the pretraining model in BERT-Multilingual,this scheme uses contrast learning to solve the problem of unmarked data.Finally,a cross-language semantic similarity calculation based on contrast learning is proposed,and currently common algorithms for unsupervised similarity calculation are compared.The results show that the model proposed in the text has obvious improvement in each index.

Keywords/Search Tags:

Parallel Corpus, Text duplication, Text Categorization, Cross-language Similarity Calculation, Pretraining model

PDF Full Text Request

Related items

1	Applied Research Of Chinese-Korean Cross-Language Text Similarity Calculation
2	Chinese-Old Bilingual Text And Sentence Similarity Calculation Research
3	An Automatic Chinese Text Categorization System Based On Statistical Language Model
4	Study On Cross Language Text Categorization
5	A Study On Text Categorization Based On Machine Learning
6	The Research On Cross Language Text Categorization Based On Interlingua Semantic
7	Research On Cross-language Text Classification Technology
8	A Study On The Method Of Constructing Bilingual Corpus In Chinese And
9	Research On The Calculation Method Of Han-Thai Bilingual News Text Similarity With News Elements
10	Language Independent Text Categorization