Font Size: a A A

Aligning Sentences From Bilingual Comparable Corpus Based On Wikipedia

Posted on:2014-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:H S HuFull Text:PDF
GTID:2248330392960917Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Sentence alignment is a problem that makes explicit the relations thatexist between two texts that are known to be mutual translations. Parallelcorpra retrieved by sentence alignment are widely used in various naturallanguage processing fields such as machine translation(MT), paraphrasegeneration and so on. As the Internet is booming in recent years, a bigquantity of cyber articles in different language conveys similar meanings.It has been a hot spot to make use of these resources in NLP researches.Comparable corpra are a new kind of corpus. They are easy toretrieve, huge in capacity, high in noise, and do not need manual alignment.The multi-lingual cyber articles discussed above are mostly comparablecorpra. However, aligning and processing comparable corpus has been adifficult task. Without manual alignment, alignment in article levelalready brings high noise, so that the performance of sentence alignmentcannot be assured.We propose a method that aligns sentences from comparable corpusbased on Wikipedia. Wikipedia is a multi-lingual wiki platform withlarge scale and rich semantic information, plus it’s already aligned atdocument level. So we implement our sentence alignment work with itsresources. Firstly, we retrieve Wikipedia data dump of English andChinese, and after processing, we re-construct a local Wikipedia corpradatabase. A bilingual named-entity dictionary is extracted fromWikipedia entries during this process. Secondly, we extract candidateparallel sentence pairs from aligned document pairs, and we analyze thecandidates to discover the character of Wikipedia corpra, so that we candesign and select our feature for the classifier. We demonstrate the necessity of using ternary classification of alignment/partialalignment/non-alignment instead of binary classification. At last weemphasize on explaining steps of sentence alignment, how to utilize SVMand ME classifier to extract parallel pairs from candidates and analyze theperformance of our methods. We apply our method to both Wikipediacorpra and a third-party parallel corpus. The result shows that theprecision of alignment can reach0.82for comparable corpus and0.92forparallel corpus.
Keywords/Search Tags:Sentence Alignment, Comparable Corpus, Wikipedia, SVM, Max Entropy
PDF Full Text Request
Related items