Font Size: a A A

Research On Parallel Corpus Construction Based On Long Text Alignment And Document-Level Alignment

Posted on:2022-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:C L CuiFull Text:PDF
GTID:2518306341482334Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Neural machine translation is one of the important applications in the field of natural language processing,but it requires a large number of parallel sentence pairs as training data to obtain good translation results.Obtaining parallel sentence pairs requires the construction of parallel corpus,including the document-level alignment process of extracting parallel texts from the corpus and the parallel text alignment process of extracting parallel sentence pairs from parallel texts.In this paper,we investigate the parallel text alignment technique and the document-level alignment technique.Long text is an important source of high-quality bilingual data,but its parallel text alignment process has many challenges.In this study,we propose the LASER-Align-XL,a long parallel text alignment method based on the LASER model,which solves the problem of alignment error accumulation caused by too long text by using anchor segmentation,solves the problem of multi-sentence alignment by sentence-level n-gram splicing,improves the accuracy of alignment results by applying machine translation,and improves the output rate of alignment by applying the slot-filling method.Compared with existing alignment methods,LASER-Align-XL achieves a significant improvement in alignment accuracy while maintaining a high output rate on three language pairs.For document-level alignment,this paper proposes two document-level alignment schemes with different ideas:document-level alignment scheme based on feature sentences and document-level alignment scheme based on the number of parallel sentence pairs.The document-level alignment scheme based on feature sentences selects feature sentences to encode sentence vectors based on the characteristics of long sentences to long sentences and short sentences to short sentences in parallel text,and performs document vectorization by pooling sentence vectors.The document-level alignment scheme based on the number of parallel sentence pairs combines the LASER-Align-XL proposed in this study to infer the document-level alignment results based on the parallel text alignment results of it.Compared with the alignment scheme based on the hierarchical document encoder,both the scheme based on feature sentences and the scheme based on the number of parallel sentence pairs are able to achieve comparable or even better results,and have better domain migration capability.Finally,the alignment results of both schemes are verified on the document alignment task of WMT-2016.
Keywords/Search Tags:parallel text alignment, document level alignment, parallel corpus construction, LASER model
PDF Full Text Request
Related items