Font Size: a A A

Research On Acquiring Bilingual Parallel Sentences And Building Corpus

Posted on:2014-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:F LiuFull Text:PDF
GTID:2268330422950598Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, the explosive growth of information on the Internet hasprovided new opportunity for statistical machine translation, and many researchgroups as well as commercial institution have begun to obtain bilingual corpus fromthe Internet on a large scale.The advantages and disadvantages of machine translation model are closelyrelated with the scale and quality of corpus. The scale of the data acquired from theInternet is much larger than the traditional way of manual collection, but theproblem of how to ensure the quality of the corpus and build corpora for machinetranslation has become a priority issue.In the field of machine translation, sentence alignment method based onmultiple features fusion between sentences have become mature, which can be usedin corpora of good quality and get good results. For appraisal of machine translationsystem, there have been a variety of common techniques, including BLEU that hasbeen a main evaluation technology which scholars always adopt since its invention.BLEU yields its score according to level of similarity between artificial translationas a reference and the output of machine translation. In the field of textclassification, for feature extraction and classification algorithm we also havedifferent relevant methods based on different situations.The main content of this paper focuses on the above three aspects. First, for thesentence alignment of bilingual parallel corpora, we adopt a multi-feature fusiontechnology of sentence alignment by mirroring TFIDF. We made a few alteration sothat it can adapt to the noise of the Internet language, added the paragraph alignmentmodule, segmented long texts, and texts with noise are aligned head to tail to solveor weaken the problem of noise; Second, for the quality evaluation of corpora, wereplaced the human translation by the output of online translation system as areference, and the BLEU algorithm is used for quality assessment to get the score.This method, compared with human translation is more efficient, so it can be appliedon large-scale corpora; For field demarcation, we tried a variety of feature extractionand classification algorithm and found appropriate combination which is moresuitable for the Internet corpora.The attempts on these three aspects have achieved good results to some extent.Through the experimental result, we have proved the availability of relatedtechnologies we used when establishing bilingual parallel corpus, namely alignmentmethod, quality evaluation method and the field demarcation method.
Keywords/Search Tags:Parallel corpus, Bilingual alignment, Quality evaluation, Fielddemarcation
PDF Full Text Request
Related items