Aligning Sentences From Bilingual Comparable Corpus Based On Wikipedia

Posted on:2014-01-07

Degree:Master

Type:Thesis

Country:China

Candidate:H S Hu

Full Text:PDF

GTID:2248330392960917

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Sentence alignment is a problem that makes explicit the relations thatexist between two texts that are known to be mutual translations. Parallelcorpra retrieved by sentence alignment are widely used in various naturallanguage processing fields such as machine translation(MT), paraphrasegeneration and so on. As the Internet is booming in recent years, a bigquantity of cyber articles in different language conveys similar meanings.It has been a hot spot to make use of these resources in NLP researches.Comparable corpra are a new kind of corpus. They are easy toretrieve, huge in capacity, high in noise, and do not need manual alignment.The multi-lingual cyber articles discussed above are mostly comparablecorpra. However, aligning and processing comparable corpus has been adifficult task. Without manual alignment, alignment in article levelalready brings high noise, so that the performance of sentence alignmentcannot be assured.We propose a method that aligns sentences from comparable corpusbased on Wikipedia. Wikipedia is a multi-lingual wiki platform withlarge scale and rich semantic information, plus it’s already aligned atdocument level. So we implement our sentence alignment work with itsresources. Firstly, we retrieve Wikipedia data dump of English andChinese, and after processing, we re-construct a local Wikipedia corpradatabase. A bilingual named-entity dictionary is extracted fromWikipedia entries during this process. Secondly, we extract candidateparallel sentence pairs from aligned document pairs, and we analyze thecandidates to discover the character of Wikipedia corpra, so that we candesign and select our feature for the classifier. We demonstrate the necessity of using ternary classification of alignment/partialalignment/non-alignment instead of binary classification. At last weemphasize on explaining steps of sentence alignment, how to utilize SVMand ME classifier to extract parallel pairs from candidates and analyze theperformance of our methods. We apply our method to both Wikipediacorpra and a third-party parallel corpus. The result shows that theprecision of alignment can reach0.82for comparable corpus and0.92forparallel corpus.

Keywords/Search Tags:

Sentence Alignment, Comparable Corpus, Wikipedia, SVM, Max Entropy

PDF Full Text Request

Related items

1	Research On Sentence Alignment Method Based On Cross-lingual Word Embeddings
2	Research On The Method Of Constructing Chinese And Vietnamese Comparable Corpus Based On
3	The Research Of Sentence Alignment In Chinese-Uighur Bilingual Corpus
4	Design And Implementation. IHSMTS Chinese-English Bilingual Sentence Alignment Mechanism
5	Building And Evaluating Special Domain Comparable Corpus
6	Research On The Automatic Construction Of Chinese-Japanese Parallel Corpus
7	The Desing And Implementation Of Uyghur-Chinese Parallel Corpus Processing System
8	Chinese Uygur Kazak Kirgiz Bilingual Corpus Processing System Design And Implementation
9	Research Of Bilingual Sentence Alignment Served The Chinese-Uyghur Machine Translation System
10	Han Lao Double Sentence Alignment Method Research