Font Size: a A A

Parallel Corpus Mining System Based On Cross-lingual Sentence Embedding

Posted on:2021-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:J J SangFull Text:PDF
GTID:2518306575955599Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of deep learning in recent years,neural networks have brought a huge improvement to machine translation.The higher the quality and wider the distribution of parallel corpus required for machine translation,the more translation knowledge the neural network can learn and the better the machine translation is.Although some agencies or organizations are committed to providing quality parallel corpus,there is still a shortage of parallel corpus in small languages.How to get more parallel corpus is always a hot topic.Fortunately,the Internet has a huge amount of data,and many websites are available in multiple languages,and the contents of web pages in different languages can be translated into each other.A parallel corpus mining system is developed to explore how to mine parallel corpus from Internet resources.Specifically,the work is divided into three aspects:firstly,extracting the features of language identifiers in URL links and filtering out web pages with multilingual versions from the massive collection of URLs;secondly,using crawlers to crawl all the text information in the parallel web pages and performing meticulous cleaning;lastly,aligning the web page texts based on cross-lingual sentence embedding to obtain the final parallel corpus.The proposed cross-lingual sentence embedding alignment method is capable of mining parallel corpus from multiple languages simultaneously,and it is very convenient and generalizable to be extended to more languages.The system uses the common crawl dataset as the mining object and obtains nearly 10 million high-quality parallel corpus in 16 languages.The performance of the mined parallel corpus on the machine translation model is similar to that of the open source OPUS parallel corpus,which can improve the machine translation effect as a data increment.
Keywords/Search Tags:parallel corpus, machine translation, sentence align, cross-lingual sentence embedding
PDF Full Text Request
Related items