| With the deepening of globalization and increasingly close international communication,Machine Translation(MT)has become a hot topic nowadays.There is a growing demand for excellent machine translation systems,Therefore,it is urgent to build an excellent machine translation system.As the data base of machine translation system,parallel corpora has been paid more and more attention by researchers,Machine translation parallel corpora usually show the characteristics of large amount of data,complex corpus categories,repeated data,no annotation and other difficult to use.Taking Chinese-English parallel corpora as an example,this thesis studies three tasks: weight removal,corpus classification and corpus quality assessment in the construction process.In view of the problems such as long time consuming and poor ability to recognize repetitions when processing large-scale short texts by traditional text rescaling methods,several popular text rescaling algorithms are selected to optimize the corpus structure and rescaling algorithms based on the parallel corpus scenario.By comparing the execution time of several algorithms and finding the number of duplicate data,it is finally proved that the parallel corpus weight removal algorithm based on Simhash has obvious advantages.Based on the current popular pre-training models,a parallel corpus classification model is proposed,which integrates the pre-training models.This model fully taps the characteristics of different languages,uses different pre-training models to extract language features,and then completes feature fusion through TextCNN.This model overcomes the defect of traditional classification algorithms which only target at single language data,causing a lot of data to be wasted.The best results were achieved in the comparative experiment,which not only fully utilized the bilingual data in the corpus,but also took into account the particularity of different languages.As for the high time and economic cost of using translation software in the quality assessment of parallel corpora,the translation results themselves have errors,which will be further amplified in the subsequent similarity calculation.Besides,there are few researches on the cross-language similarity calculation direction without annotated data,and the solution of parallel corpus quality assessment is proposed.Based on the pretraining model in BERT-Multilingual,this scheme uses contrast learning to solve the problem of unmarked data.Finally,a cross-language semantic similarity calculation based on contrast learning is proposed,and currently common algorithms for unsupervised similarity calculation are compared.The results show that the model proposed in the text has obvious improvement in each index. |