Font Size: a A A

A Study On The Key Technologies Of Web-Based Indonesian-Chinese Parallel Corpus Construction

Posted on:2022-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:S S LiFull Text:PDF
GTID:2518306530966729Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the continuous advancement of globalization and the rapid development of the Internet,cross-language natural language processing plays an important role in eliminating language barriers and achieving interconnection.Meanwhile,more and more media websites have begun to publish information in bilingual or multilingual ways in order to eliminate language barriers,which makes it possible for us to obtain large-scale bilingual corpus resources from the Internet.As a basic resource in the field of cross-language natural language processing,parallel corpus contains rich bilingual knowledge.Therefore,the construction of Web-based parallel corpus has become an important research topic in the field of Natural Language Processing.At present,the research mainly focuses on bilingual parallel web page mining,sentence alignment technology and word alignment technology.However,as a low-resource language,Indonesian has received far less attention in the field of natural language processing than English,Chinese and other common languages.Currently,there are few parallel corpus researches on Indonesian and Chinese at home and abroad,not only lacking large-scale Indonesian-Chinese parallel corpus,but also few cutting-edge methods and models applied to its research.In response to these problems,this article mainly carried out the following three aspects of work to use more cutting-edge methods to study the key technologies of the construction of the Indonesian-Chinese parallel corpus and provide important data support for its related cross-language information processing research.First,the construction of a Web-based parallel corpus.We obtained a collection of Indonesian-Chinese bilingual websites through manual search,and proposed a method of obtaining bilingual parallel webpages based on a mixture of URL similarity and HTML structure similarity,with the help of manual review to prove the effectiveness of this method;then web page cleaning was performed to obtain the text-level Indonesian-Chinese parallel corpus;finally,449,972 Indonesian-Chinese parallel sentence pairs were obtained according to web rules and manual review methods,which provided a solid data basis for subsequent research.Second,the study of sentence alignment based on deep learning.We explored how to achieve sentence alignment using only bilingual parallel sentence pairs and reduce the dependence on the Indonesian-Chinese bilingual dictionary.We have implemented a sentence alignment model based on Bidirectional Long Short-Term Memory neural network.In addition,in order to make up for its shortcoming of focusing only on global features in the feature extraction process,we have implemented Bidirectional Long Short-Term Memory neural network based on the attention mechanism.And the third deep learning sentence alignment model is fine-tuned based on the BERT model.During the experiment,we regarded the sentence alignment model based on the length model as the baseline model.The result showed that the effects of methods based on deep learning are significantly better than the baseline model.The best model is the Bidirectional Long Short-Term Memory neural network model with the attention mechanism,and the accuracy rate reaches 98.20%.Third,research on word alignment based on deep learning.Due to the lack of large-scale Indonesian-Chinese bilingual dictionaries,we explored the applicability of word alignment methods that do not rely on bilingual dictionaries in the task of Indonesian-Chinese word alignment.Multilingual BERT was used to train the context vectors of the source language and the target language respectively to realize the word alignment model based on the context vector;we applied the adversarial neural network to learn the mapping matrix of the source language and the target language vector space to achieve unsupervised word alignment.The result demonstrated that the best model is the word alignment model based on context vectors,which fully embodies the powerful language representation capabilities of Multilingual BERT and the effectiveness of the two-way Transformer for in-depth context analysis,with an accuracy rate of 88.04%.
Keywords/Search Tags:Indonesian-Chinese parallel corpus, Bilingual parallel web page mining, Sentence Alignment, Word Alignment, Deep Learning
PDF Full Text Request
Related items