Font Size: a A A

The Research And Construction Of Comparable Corpora

Posted on:2010-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:H T YuFull Text:PDF
GTID:2178360302460675Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The parallel corpora, as a kind of corpora, is widely used in the field of Computational Linguistics and Natural Language Processing. However, the current parallel materials are mainly parliamentary debates or legal texts, the well-aligned and high quality parallel materials remain a scarce resource. Despite the efforts made by some researchers to automatically collect parallel corpora from the web, it is challenging to get large-scale and high quality parallel corpora due to the diversity of web information format and the liberty of web content. As a result, the current parallel corpora are not fully adequate for the task at hand in the aspects of scale, freshness and domain balance.To address the limitations of parallel corpora, the researchers in the field of Computational Linguistics and Natural Language Processing conduct research into the use of comparable corpora. Compared with parallel corpora, comparable corpora overcome the limitations of parallel corpora, since the sources for comparable corpora are much more abundant. There is no public research work on the construction of comparable corpora in domestic yet. This dissertation is based on the project-Mining English-Chinese named entity pairs based on multi-feature integrated models from comparable corpora, supported by Microsoft Corporation and The National High Technology Project of China's 863 Program. Our goal is to construct Chinese-English comparable corpora, explore and address the problems during the construction of comparable corpora.We proposed own method for comparable corpora construction based on the former relevant research works. The method can be divided into the following two parts:(1) Using incremental crawling technology to harvest and update the local webpage document repository, which is the resource for comparable corpora, so as to maintain the freshness of the comparable corpora. This is also one innovative point in our work. Different from the former construction work, we considered the dynamic update of webpage document in the dimension of time, which make our comparable corpora fresher.(2) Using cross-language information retrieval technology (Chinese to English in our work) to retrieve similar documents from the target language document repository and construct relevant document pool, create a mapping between source and target documents through alignment process, and get the comparable corpora at last. During the construction work, we proposed an effective method to resolve the Chinese-OOV problem, which has an important effect to the efficiency of cross-language information retrieval . It is helpful for the alignment quality of documents in our work. This method first analyzes the translation feature of an OOV term, namely the identification of translation OOV, transliteration OOV and mix-translation OOV, and then the identification result is used in the follow procedures of candidate translation extraction and selection. This is beneficial during the combination use of translation model and transliteration model. Different weight will be valued according to the OOV translation feature, so more precise result we can get.The experiment result shows that our method for comparable corpora construction is effective.
Keywords/Search Tags:Parallel Corpora, Comparable Corpora, Incremental Crawl, Cross-language Information Retrieval, Out-Of-Vocabulary Term
PDF Full Text Request
Related items