Font Size: a A A

Research On Construction And Application Of English-Chinese Comparable Corpora

Posted on:2012-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:L FangFull Text:PDF
GTID:2218330368493197Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The bilingual parallel corpora is widely used in the field of Computational Linguistics and Natural Language Processing, which provides essential training data for statistical machine translation, and can be used in lexicography and cross-language information retrieval. However, the acquisition of a large scale bilingual parallel corpus is not easy, the well-aligned and high quality parallel remain a scarce resource. Despite some researchers have proposed several effective solutions of the bilingual parallel corpora automatically mining form the Web, it is different to get large scale and high parallel corpora in real application due to the complexity and diversity of the web information.To the shortages of parallel corpora such as limitations on scale and diversity, and also can not handle the new word problem well. Many researchers conduct research into the use of comparable corpora. Compared with parallel corpora, comparable corpus is more abundant, up-to-date and accessible. Therefore, the construction and application of comparable corpora has been paid more attention on research.This paper studies the web-based method of building an English-Chinese comparable corpus, and the application of comparable corpus in extraction of translation equivalents and cross- language information retrieval.Before building a comparable corpus, we study how to get the large-scale bilingual texts from the Internet first, and also propose a method to create special domain collections from news sites. These works could lay a basis for the construction of a comparable corpus. After acquiring the large-scale bilingual texts, using cross-language information retrieval technology to retrieve similar documents from the target language document repository, create a mapping between source and target documents, that is, create a mapping between English and Chinese documents, and get the English-Chinese comparable corpora at last.In the applications of comparable corpora, the first step is extracting translation equivalents from the comparable corpora with the context of vector-based approach, and proved the effectiveness of the context vector extraction method, and compared the performance for different contexts vector methods. Then, the extracted translation equivalents are applied to the cross-language information retrieval, dictionary-based approach and parallel corpora-based approach with the comparison, the experiments show that the corpus-based query translation method is better than the dictionary-based approach, and the comparable corpora-based approach is better than the parallel corpora-based approach.
Keywords/Search Tags:Bilingual Resources, Comparable Corpora, Translation Equivalents, Cross-Language Information Retrieval
PDF Full Text Request
Related items