Font Size: a A A

Building And Evaluating Special Domain Comparable Corpus

Posted on:2013-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:S LiuFull Text:PDF
GTID:2218330371960216Subject:Information Science
Abstract/Summary:PDF Full Text Request
Bilingual dictionary and parallel corpus play an important role in multilingual information processing, such as machine translation, cross-lingual information retrieval and so on. However, these resources are scarce and difficult to be collected for some under-resourced languages or special domains. By contrast, comparable corpus is easier to be obtained by finding multilingual text collections with similar Topics rather than find collections that are translations of each other. Therefore, it is worthwhile to discuss the question on domain comparable corpus construction and evaluation. On one side, it can greatly enrich the existing theoretical system; on the other side, it can offer large-scale, high-quality corpus resources. In this thesis, it first attempts to collect comparable corpus from different web data source. Next, it tries to measure the corpus'comparability based on cross-language similarity and distribution consistency of subject. Finally, it evaluates quality of comparable corpus with both internal and external evaluation.Three different Internet data source have been used for collecting corpus. One is querying bilingual domain keywords in search engine for gathering corpus. Another is obtaining comparable corpus by exploiting the online encyclopedia-Wikipedia.Third is acquiring domain Chinese-English corpus from academic databases.Traditional statistical-based, frequency-based, termhood-based are used to measure the similary of corpus. Result shows termhood-based measure performs best, traditional statistical-based measure least.Corpus Evaluation is done from two sides, internal and external evaluation. The comparison among words using descriptive statistics measure and the similarity of sub-corpus are used to directly access the internal consistency. Bilingual terminology extraction based on comparable corpus is exploited to indirectly access the quality of comparable corpus.
Keywords/Search Tags:Domain Comparable Corpus Construction, Comparable Corpus Construction, Comparability Metrics, Corpus Evaluation
PDF Full Text Request
Related items