Exploiting comparable corpora

Posted on:2007-08-26

Degree:Ph.D

Type:Thesis

University:University of Southern California

Candidate:Munteanu, Dragos Stefan

Full Text:PDF

GTID:2448390005478257

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

One of the major bottlenecks in the development of Statistical Machine Translation systems for most language pairs is the lack of bilingual parallel training data. Currently available parallel corpora span relatively few language pairs and very few domains; building new ones of sufficiently large size and high quality is time-consuming and expensive.; In this thesis, I propose methods that enable automatic creation of parallel corpora by exploiting a rich, diverse, and readily available resource: comparable corpora. Comparable corpora are bilingual texts that, while not parallel in the strict sense, are somewhat related and convey overlapping information. Such texts exist in large quantities on the Web; a good example are the multilingual news feeds produced by news agencies such as Agence France Presse, CNN, and BBC.; I present novel methods for extracting parallel data of good quality from such comparable collections. I show how to detect parallelism at various granularity levels, and thus find parallel documents (if there are any in the collection), parallel sentences, and parallel sub-sentential fragments. In order to demonstrate the validity of this approach, I use my method to extract data from large-scale comparable corpora for various language pairs, and show that the extracted data helps improve the end-to-end performance of a state-of-the art machine translation system.

Keywords/Search Tags:

Language pairs, Comparable corpora, Parallel, Data

PDF Full Text Request

Related items

1	The Research And Construction Of Comparable Corpora
2	Research On Key Technology In Mining Web Bilingual Corpora
3	Creating Chinese-English Comparable Corpora
4	Research On Construction And Application Of English-Chinese Comparable Corpora
5	The Construction Of Large-scale Chinese-English Comparable Corpora
6	Research On Chinese-Thai Bilingual Corpus Mining Method For Internet News
7	Improving statistical machine translation using comparable corpora
8	Research On Extraction Of Bilingual Multi-word Term Translation Pairs From Comparable Corpora
9	Research On Term Extraction Method Based On Comparable Corpora
10	Mining Chinese-English Named Entity Pairs From Comparable Corpora